fidelity / seq2pat Goto Github PK

View Code? Open in Web Editor NEW

107.0 17.0 13.0 9.07 MB

[AAAI 2022] Seq2Pat: Sequence-to-Pattern Generation Library

Home Page: https://fidelity.github.io/seq2pat/

License: GNU General Public License v2.0

Jupyter Notebook 27.93% C++ 16.55% Python 53.36% Makefile 0.16% Batchfile 0.19% Cython 1.80%

sequential-pattern-mining multi-valued-decision-diagrams data-mining knowledge-discovery pattern-mining

seq2pat's People

Contributors

Stargazers

Watchers

Forkers

wsgan001 muhammaddtalha vishalbelsare conanoutlooklvtbs football-strategy-analysis nashid ada0110 kuhaha sandy4321 arita37 victorjmarin nagireddyakshay harryreyesnieva

seq2pat's Issues

【QUESTION】Any way to limit the minimum length of each pattern?

Thanks for this helpful work.
I wonder if there is any way to limit the minimum length of each pattern.
Referring to the Dichotomic Pattern Mining paper, I add an order attribute (the sequential order in a sequence, e.g. [A, B, C, D] has the attribute [0, 1, 2, 3]) and add a constraint 5 <= att_order.span() <= 10 to limit the length to between 5 and 10. However, I find it does not work.

Can you share how to deal with the above issue?
Thank you very much.

does it suitable for classification framework?

great code and paper thanks
only
does it suitable for classification framework
dichotomy of positive vs. negative outcomes in populations
for example from 1000 sequences

200 belongs to class YES
and
800 belongs to class NO

then
1
find patterns from YES belongs to YES but not to NO
2
find patterns from NO belongs to NO but not to YES

A strange question, too many mining results cause jupyter to crash.

I encountered a strange problem while using it. I have a total of 1541 sequences, and the items in each sequence have time attributes (in milliseconds). When mining, I added span and gap constraints. When I reduce the constraint, the mining result is 0. When I increase the constraint a little bit, my jupyter will directly hang.This makes me feel puzzling. I guess the reason why Jupyter crashed was because there were too many results, which caused memory overflow, but I only increased the constraint a little bit, and logically, it won’t increase the result too much.

Please tell me how to solve this problem, thank you very much!

Windows pip installs on 32bit but is missing modules

I'd suggest adding a note that 64bit python is needed. Not sure if it's needed on linux, but at least on Windows the pip installed but I got this error until I switched to a 64bit python install:
ModuleNotFoundError: No module named 'sequential.pat2feat'

Inquiry on performance-improvement in context of small number of long sequences

Thank you for the great repository!

Seq2Pat Experimental/Results summary indicates that the "batch_size" parameter can lead to performance improvement for a large number of sequences (e.g., 100k to 1M).
seq2pat experimental/results summary

However, in context of smaller number of data points (hundreds or thousands) but long sequences (30,000 to 50,000 length), my guess is "batch_size" may not be as helpful.

In this case, would increasing the "n_jobs" parameter be beneficial?

[Question] - Sequence of transaction basket

Hi,

thank you for your work.
Is your software able to handle sequence formed like the one below?

seq2pat = Seq2Pat(sequences=[ [["A1", "A2", "B"], ["A", "D"]],`
                              [["C", "B"], ["A"]],`
                              [["C", "A"], ["C", "D"]]])`

So, for instance, the first sequence is a tuple of 3 elements - A1,A2,B - followed by a couple of 2 elements - A,D.

Labels not included in encodings output

Hi. This library is very intriguing! I've been able to install and use the library with some of my own data. I've been able to replicate the example outputs from your example notebooks. So far, so good.

When I look at the output created in the encodings (ie the output from a line such as this: encodings = pat2feat.get_features(sequences, dpm_patterns, drop_pattern_frequency=False), I see the sequence with all of the respective one hot columns for the pattern features identified. But here is where I'm stuck. This output doesn't include the original label of each individual example. How do I add the label to this output for later machine learning tasks? Do I need to subsequently join this output back to my original dataset on the sequence column? Or, is there a way to "carry forward" the respective labels throughout the DPM processing through and into this final output?

Thanks for a great library- Tim

Request for metadata on the event_time column

Hi there!

Thank you for the amazing library and for sharing it openly!

I gave it a try on running cell-by-cell code from the dichotomic_pattern_mining.ipynb file using the same dataset from Requena (2020). By referring to the sample_data.csv used in the mentioned .ipynb file, do you mind elaborate more on the metadata of the column event_time, for example, the unit (is it in seconds, milliseconds, or in other form?).

You also mentioned that the event_time column (as well as the other columns) is extracted from the original dataset. I am assuming that the event_time column is constructed from the column server_timestamp_epoch_ms which we can also get from the original dataset. If so, do you mind sharing how the process of constructing the event_time was done?

Looking forward to hear from you. Thanks!

Changes to not allow arcs to skip layers

Hi! Thank you for open-source and sharing the work.

I have a slightly different task that doesn't want to allow skipping items during pattern mining(not allowing arcs to skip layers). What we want is to have the mined pattern must be consecutive items in the sequences. Which part of the code should I change to enforce layer-by-layer mining?

Thank you!

FEATURE REQUEST

I would like to thank for this work, and is very helpful.

I have a request/question concerning about output generation. Currently, the sequences/output are generated with their support(count) at the right hand side, eg [ ['A', 'B', 3], ['C', 'D', 'A' 6]], whereby 6 and 3 represent support (count) of generated sequence respectively.

Is there a means of generating that support(count) in terms of %? instead of raw integer?

Or the best option will be if the entered support is in %, then the output support should be in %. Also if if the entered support is in raw integer then the output support should be in raw integer.

Thank you.

Memory Leak in C++

Hello,
the module does not free the memory it reserves. Deleting the seq2pat object in python does not change anything.
Running the following code will use up more and more memory as long as the loop runs.

for i in range(2000000):
    seq2pat = Seq2Pat(sequences=[["a","a","a"], ["a","b","c"],["a","b","b"],["a","a","a"], ["a","b","c"],["a","b","b"],["a","a","a"], ["a","b","c"],["a","b","b"]])
    traces = seq2pat.get_patterns(min_frequency=5)

Runtime error: if name == 'main'

Hello. I'm receiving the following error when I attempt to run the Sequential Pattern Mining Average Constraint example found here:

https://github.com/fidelity/seq2pat/blob/master/notebooks/sequential_pattern_mining.ipynb

Below is the error. I have installed both Cython and the compiler and I am running Python 3.7

_RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable._

Installation error

Hi, I am getting the following error while trying to install the package:

LINK : fatal error LNK1158: cannot run 'rc.exe'
C:\Users..\AppData\Local\Temp\pip-build-env-0ce14mib\overlay\Lib\site-packages\setuptools\command\build_py.py:202: SetuptoolsDeprecationWarning: Installing 'sequential.backend' as data is deprecated, please list it in packages.
!!

  ############################
  # Package would be ignored #
  ############################
  Python recognizes 'sequential.backend' as an importable package,
  but it is not listed in the `packages` configuration of setuptools.

  'sequential.backend' has been automatically added to the distribution only
  because it may contain data files, but this behavior is likely to change
  in future versions of setuptools (and therefore is considered deprecated).

  Please make sure that 'sequential.backend' is included as a package by using
  the `packages` configuration field or the proper discovery methods
  (for example by using `find_namespace_packages(...)`/`find_namespace:`
  instead of `find_packages(...)`/`find:`).

  You can read more about "package discovery" and "data files" on setuptools
  documentation page.
check.warn(importable)

error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe' failed with exit code 1158
ERROR: Failed building wheel for seq2pat
Failed to build seq2pat
ERROR: Could not build wheels for seq2pat which use PEP 517 and cannot be installed directly

QUESTION

Thank you very much for the nice work.

I have a question(suggestion/feature request if not present).

Assume we have the following dataset as presented in this repo:

[["A", "A", "B", "A", "D"],
["C", "B", "A"],
["C", "A", "C", "D"]]
with the following attribute (time-attribute)

[[1, 1, 2, 3, 3],
[3, 8, 9],
[2, 5, 5, 7]].

Is there a possibility of avoiding/restricting patterns that are not with the same item? for example, ignore a pattern that will result in [A, A] or [C, C] etc. i.e item(i) != item( i+1).
Is a there possibility of restricting patterns to be generated if only started by a certain item. For example,
generate a sequence if only started by A or B or C etc.
Is there a possibility of restricting patterns to end by a certain value? For example,
generate a pattern only if it end with one item item, C or D or A.

Can you share on how to deal with above scenario.

Thank you.

Integer sequences containing zero (0) as an event

Firstly, thank you for sharing your work.

I found out that using 0 as an event in a sequence of integers is problematic. The output may be incorrect (empty) or a segmentation fault may occur.

from sequential.seq2pat import Seq2Pat

# "0" as a string works well
seq2pat = Seq2Pat(sequences=[['0', '0'], ['0', '0']])
patterns = seq2pat.get_patterns(min_frequency=2)
print(patterns)  # Outputs [['0', '0', 2]]

# No error but incorrect output
seq2pat = Seq2Pat(sequences=[[1, 0], [1, 0]])
patterns = seq2pat.get_patterns(min_frequency=2)
print(patterns)  # Outputs []. Expecting [[1, 0, 2]].

# Produces a segmentation fault
seq2pat = Seq2Pat(sequences=[[0, 0], [0, 0]])
patterns = seq2pat.get_patterns(min_frequency=2)
print(patterns)

It could be helpful to document this issue in the README.md until a fix can be produced either by fixing the underlying cause, converting sequences to strings, or preprocessing integer sequences to rename 0 events.

Review Issue, @TimKam

Dear all,

let us use this issue to discuss my review (you can branch off smaller issues if you prefer).
You find the initial review below.

Best,
Timotheus

The contribution introduces a Python-based sequence-to-pattern generation library that is based on/re-uses recent research results in the domain.

All in all, the contribution seems to be on the way to a good state, with the following limitations:

Contribution and authorship: The main author and generally 3 of 5 authors have made substantial contributions to the software. The most senior "not contributing author" seems to be a leading expert on the topic. All in all, this looks like a reasonable list of authors (given common academic practice), even though not all authors have contributed code/docs. (No action needed)
Substantial scholarly effort: in essence, the library is a wrapper around research code that has been written for the corresponding AAAI 2019 paper. The authors should be more clear about this fact. (The paper is prominently linked; for sure, the authors did not aim to disguise this fact.) This does not mean that the contribution is not valuable, but it certainly has implications on the additional effort that was necessary to create the library. Here, I'd like to get the perspective of an editor.
Automated tests: It would be good to configure a CI for the repository. Even for review purposes this would have been neat. No documentation on how to run the tests exists. Navigating to tests and running pytest test_seq2pat.py executes 40 tests, at least some of which have multiple assertions. All tests seem to be managed in one file (1244 LOC), which is not according to good practices. At least unit tests should roughly reflect the project structure.
Examples and API documentation: Very simple examples are provided and the API is documented.

3.1. However, the examples are largely abstract (the concepts of "items" and "price" are used in the examples, but that's about it) and do not illustrate the practical problems the library can potentially solve. This is very unfortunate, because for a non-expert reader it is very hard to understand the practical benefit of the library. I strongly recommend to add examples that relate to a potential application domain, even if these examples are merely toy examples.

Nit-picks:

3.2. In the docs, it seems a bit odd to me that the API doc has the heading "Seq2Pat Public API". Public vs. Private 1) does not exist in Python and 2) is in the context of APIs more commonly used in commercial software/software-as-a-service scenarios.

3.3. Currently, the repository contains the sources of the docs and the Sphinx build result. Using a free-of-charge service like Read the Docs for automatically pulling the sources from GitHub and building them is more elegant.
Community guidelines: Community guidelines for contributing do not exist.
Statement of need: a statement of need exists in the paper, not in the docs. The paper could benefit from a less abstract perspective on the practical usefulness of the library (same as with the examples, see above). Considering that the software is co-developed by some sort of investment firm, providing a practical motivation should be possible.
Summary: see above, an introduction for non-experts is missing.
State of the field: the authors provide an overview of the state of the field, but do not motivate why they create a new library instead of improving existing libraries.
The authors make good use of references. However, in the introduction, some references that point to high-level overviews, applied research papers and/or surveys would be nice.

Patterns with a single event

Hello there!

Single-event patterns seem to be problematic. Below are two examples depicting the issue:

from sequential.seq2pat import Seq2Pat

seq2pat = Seq2Pat(sequences=[["A"], ["A"], ["A"]])
seq_patterns = seq2pat.get_patterns(min_frequency=2)
print(seq_patterns)  # Outputs []. Expected [["A", 3]]

seq2pat = Seq2Pat(sequences=[["A"], ["A", "B"], ["A"], ["A", "B"]])
seq_patterns = seq2pat.get_patterns(min_frequency=2)
print(seq_patterns)  # Outputs [["A", "B", 2]]. Expected [["A", 4], ["A", "B", 2]]

Thank you.

Performance warning when using pat2feat.get_features

Thank you for this wonderful library, it works really well so far. When using pat2feat.get_features to extract features for many patterns, then I get lots of

/sequential/pat2feat.py:79: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df['feature_' + str(i)] = df.apply(lambda row: is_satisfiable_in_rolling(row['sequence'], pattern,

Would it be possible to generate the column data first and then bulk create the dataframe by concatenating, instead of adding them to the dataframe 1 by 1? Would it also be possible to directly return a numpy array by pat2feat, as pandas is often an overkill?

#Sequences: 108
#Patterns: 4300

pandas==2.0.2
seq2pat==1.4.0

data still available to download?

is data set from
Seq2Pat: Sequence-to-Pattern Generation
for Constraint-Based Sequential Pattern Mining

still available to download
if yes
what is the link ?
Clickstream Data The dataset contains rich clickstream
behavior on online users browsing a popular fashion ecommerce website (Requena et al. 2020).

Feature Request: get attributes associated with mined pattern

First of all, let me thank you for this amazing library. The pattern mining works like a charm.

Right now I am looking for a possibility to use the patterns in the context of the original sequence. The easiest option IMHO would be to view, in which sequences the mined pattern occured.

As far as I understood the algorithms, this information is stored in the MDD, alongside the attributes for each item in the sequence, which I am also interested in.
Would it be possible for you to include an option, that returns these informations with the mined patterns, i.e. return a pattern in the following form:

[{'item':'A', time:[1,1,1,2,3]}, {'item':'B', 'time':[2,3,2,4,5]}, [1,2,5,8,12]]

Here, in addition to the items of a pattern, the associated attribute list is attatched (in this case only the 'time' attribute).
And instead of (or possibly in addition to) the count of the occurences of the pattern, a list with the index of the input sequence is attatched. In this case the pattern occured in sequences 1,2,5,8 and 12.

This is only a suggestion for the output format, feel free to adapt it to the current internal data structure. I am not sure how complex it is to add this information. I think for starters, the sequence IDs would be really helpful.

Please let me know if this is something you would inlcude in the library.

fidelity / seq2pat Goto Github PK

seq2pat's People

Contributors

Stargazers

Watchers

Forkers

seq2pat's Issues

Recommend Projects

Recommend Topics

Recommend Org