cool-cohort / cool Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 15.0 22.7 MB

the source code of the COOL system

Home Page: https://www.comp.nus.edu.sg/~dbsystem/cool/

License: Apache License 2.0

Java 91.74% Shell 0.28% Python 7.98%

cohort-analysis java

cool's People

Contributors

Stargazers

Watchers

Forkers

hugy718 ooibc88 kimballcai lemonviv liuchangshiye nusdbsystem yinghongbin nlgithubwp raghavchalapathy qlinsey connorlee487 ishantwankhede ooibc zrealshadow alexxiao007

cool's Issues

Cannot parser float type with the same logic as int

In table.yaml, the column with float64 type should be processed with the same logic as int, but it's not supported.

Discussion about using current example dataset to generate cohort query

We want to generate cohort query from sogamo dataset for cohortQueryProcessing unittest.
Through some simple data analysis, there some problems. we found that:

In sogamo dataset, there are only 4 players in the entire dataset which contains 10k items. Thus the cohort query in old-version code is not representative. It can not work well as a unittest. According to the CoHANA paper, the raw data is larger than the sample data current we have. I recommend use raw data to generate test cohort query.

In tpch dataset, there is a same problem. There is only 1 user in the entire dataset. Total order in this datasets is about the same user.

BUILD FAILURE: test failures

ENV: On MacBook Pro 11.6.5
Steps on Terminal:

git clone https://github.com/COOL-cohort/COOL.git
mvn clean package
Output:
[INFO] [------------------------------------------------------------------------]
[INFO] Reactor Summary for cool 0.1-SNAPSHOT:
[INFO]
[INFO] cool ............................................... SUCCESS [ 13.606 s]
[INFO] cool-core .......................................... FAILURE [ 17.090 s]
[INFO] cool-extensions .................................... SKIPPED
[INFO] parquet-extensions ................................. SKIPPED
[INFO] hdfs-extensions .................................... SKIPPED
[INFO] arrow-extensions ................................... SKIPPED
[INFO] avro-extensions .................................... SKIPPED
[INFO] cool-queryserver ................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 30.792 s
[INFO] Finished at: 2022-04-18T20:52:21-07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project cool-core: There are test failures.
From testng-results.xml (cannot attach due to file type),

System can not well support the `NULL` value

Currently, the COOL system cannot support the NULL value in the INT type columns, and it will assign NULL as a separate class for String type columns.

Maybe we can add bit maps to deal with NULL values.

there is redundancy in the storage in the data chunks.

there is redundancy in the storage for the several features that have a one-to-one mapping with the user ID, e.g. age, gender, birth year, and other demographic information.

Input cohort implementation

This system can save cohort results in a certain format, and we need to reload these cohort results to filter users when processing other cohort queries.

Inefficiency in CoolModel class

The CoolModel class is currently serve to keep various readstores in memory without loading the the actual chunk data. It looks like a caching layer, then there are a few issue to address.

memory management: there should be a fixed capacity to control how many cube/cohort/... are loaded and when loading and processing mark how much space they take, free when no space
methods are marked with locked with time-consuming IO inside, we should only lock only the steps that modify the cache internal structure
there is a field called currentCube, so it can serve the cohort of a cube at one time. This is not necessary. Multiple workers should be able to concurrently request for cohorts under different cubes with only necessary synchronization

code formating with checkstyle

Use checkstyle to enforce a consistent code formatting.

inconsistent code formatting between legacy and new codes, and between Intellij users and vscode users.
Developers are enticed to change formatting, for example indentations, to remove IDE warning or simply because of different coding habits. But, the bad thing is that, during code review, formatting changes confuse the Github diff highlighter and obscure what needs to be reviewed.
So my plan is:
- We provide an checkstyle xml file (I intend to use the google_check.xml in this repo, so all developers uses one consistent settings to format their codes.
- One time effort to remove all checkstyle detected problems on formatting, naming, etc. And all of us uses that as the new base of development and fix all checkstyle warning before submitting new PRs.

Missing entrires in results of query1-0.json

After executing the command with query1-0.json to obtain the specific users from the health dataset, COOL shows 8513 users. But the correct number of users should be 8592. The user ids that start with "P-19709" are all missed. The missed user ids are as follows.

 P-19709, P-19712, P-19713, P-19718, P-19720, P-19721, P-19732, P-19749, P-19762, P-19767,
 P-19772, P-19773, P-19776, P-19777, P-19778, P-19780, P-19782, P-19789, P-19790, P-19791,
 P-19797, P-19798, P-19799, P-19801, P-19803, P-19805, P-19806, P-19809, P-19812, P-19813,
 P-19820, P-19830, P-19834, P-19841, P-19842, P-19843, P-19844, P-19846, P-19860, P-19861,
 P-19864, P-19872, P-19874, P-19879, P-19882, P-19887, P-19890, P-19893, P-19901, P-19902,
 P-19907, P-19909, P-19910, P-19911, P-19915, P-19919, P-19923, P-19929, P-19931, P-19941,
 P-19944, P-19948, P-19950, P-19952, P-19954, P-19959, P-19962, P-19964, P-19972, P-19973,
 P-19975, P-19978, P-19983, P-19985, P-19989, P-19991, P-19992, P-19994, P-19998

Need to conserder the precision problem among time and storage.

If we use int32 to store the time(i.e., the time interval), we can only analyze 68 years because of the INT32 ranges in [-2147483648, 2147483647].

For more details, please check the problem YEAR2038.

Next, we need to support int64 storage format.

Restructure cohort result output

The output cohort logic has at least one bug. The users are repeatedly added to the cohort output.

Iceberg support is not ported over after we restructure processors

query result cohort size is incorrect

Problem

cohort size recorded in the query_res.json is inconsistent with the user list size stored in all.cohort. The issue is due to missing code in de-duplication in adding users during cohort processing.

COOL/cool-core/src/main/java/com/nus/cool/core/cohort/storage/CohortRet.java

Lines 45 to 50 in 80070ff

    
             void add(String user) { 
        
               if (!added.contains(user)) { 
        
                 userSequence.add(user); 
        
               } 
        
             } 
        
           }

OLAP Design

The COOL can support the following OLAP query:

SELECT cout(*), sum(O_TOTALPRICE) 
FROM TPC_H WHERE O_ORDERPRIORITY = 2-HIGH AND R_NAME = EUROPE
GROUP BY N_NAME,R_NAME 
HAVING O_ORDERDATE >= '1993-01-01' AND O_ORDERDATE <= '1994-01-01'

The design of OLAP follows the current steps:

Run select operator matching records
Run group-by according to the given fields
Run aggregation on each group.
Reformat the result into OLAP format.

The project structure.

The logic visualization.

The result:

{
  "status" : "OK",
  "elapsed" : 0,
  "result" : [ {
    "timeRange" : "1993-01-01|1994-01-01",
    "key" : "RUSSIA|EUROPE",
    "fieldName" : "O_TOTALPRICE",
    "aggregatorResult" : {
      "count" : 2.0,
      "sum" : 312855,
      "average" : null,
      "max" : null,
      "min" : null,
      "countDistinct" : null
    }
  }, {
    "timeRange" : "1993-01-01|1994-01-01",
    "key" : "GERMANY|EUROPE",
    "fieldName" : "O_TOTALPRICE",
    "aggregatorResult" : {
      "count" : 1.0,
      "sum" : 4820,
      "average" : null,
      "max" : null,
      "min" : null,
      "countDistinct" : null
    }
  }, {
    "timeRange" : "1993-01-01|1994-01-01",
    "key" : "ROMANIA|EUROPE",
    "fieldName" : "O_TOTALPRICE",
    "aggregatorResult" : {
      "count" : 2.0,
      "sum" : 190137,
      "average" : null,
      "max" : null,
      "min" : null,
      "countDistinct" : null
    }
  }, {
    "timeRange" : "1993-01-01|1994-01-01",
    "key" : "UNITED KINGDOM|EUROPE",
    "fieldName" : "O_TOTALPRICE",
    "aggregatorResult" : {
      "count" : 1.0,
      "sum" : 33248,
      "average" : null,
      "max" : null,
      "min" : null,
      "countDistinct" : null
    }
  } ]
}

Invariant range filed lost min-max information in DataStore

Previously, each range filed had a min and max value in the DataField. Now, since we merge the invariant fields with userID, we lost those information.

Retrospective cohort study

There are mainly two types of cohort study: Prospective Cohort Study and Retrospective Cohort Study.

In the prospective cohort study, we need cohort analysis to study the result of some actions. While in the retrospective cohort study, cohort analysis is needed for exploring the reasons for some outcome.

The big difference between them is the definition of age. The age in the prospective cohort study is a period of time after birth time. But the age in the retrospective cohort study is a period of time before the birth time.

Now COOL supports prospective cohort study. We need to implement the retrospective cohort study as well.

CoolTupleReader logic rework.

Reasons:

change in cohort storage format. global id list > string list
global id no longer unique across cublet
previous impl assumes that users with unique id and the order in cohort file and raw data are the same.

plan

map string to global id at the beginning for each cublet.
iterate through data and check if user is contained in the target cohort list for the cublet

status

comment out CoolTupleReader for now. The downstream exploration application is already commented

Format the PR code

the code format of the PR(#105) has not been checked.

The next step is to clean all outdated code in the previous version and format the rest code.

Cohorts partition by events (e.g., actions)

In eCommerce use-cases, we always need to group users based on several events (i.e., selectors), and the COOL system should support this functionality.

DistributedController cube loading only one cublet

Currently the implementation of DistributedController at L86 and L185 calls the reload method of CoolModel that accepts a buffer and only loads that buffer to CubeRS. Therefore, it cannot handles cubes with multiple cublets.

Moreover, this overloaded reload is only used here. It does not make much sense to reload just one cublet. We can remove that method later.

Refactor storevector

The current organization of storevector seems quite ad-hoc. The input vector is not a good abstraction, defining a set of interface methods, which aren't general enough to be used/implemented by its implementing classes.

Give a few ideas about my plan here:

define a generic value class (like a union type containing int or string for now). It can be compared during processing. This is to hide any value type specific logic.
all input vector return generic value instances thus avoiding explicit casting in our code base of input vector.
input vector has a set of interface methods get, hasnext, next. which directly works on the state of internal buffers in the implementing classes. Implementing Iterable interface and using iterator to avoid directly manipulating buffer state is a better approach, I think.

Any comments?

Cannot run group by on filed with ActionTime Type in OLAP query

Revisit the data structures used to represent a cohort

After merging the PR #72 , we shall revisit how to store a cohort. Maybe there is a better approach than using a list of global id for subsequent processing.

Supporting append relaxes the assumption of monotonic increasing user key in a cube, which will break all analysis when a cohort input is given. The problem is that during processing, the input vector of cohort is checked from small to large userkey id only once. No retraction.

The cohort representation is also related to PR #75 as well. We no longer plan to maintain a consistent global id across cublets, thus we cannot use a list of id from a cube to represent a cohort.

One possible direction is to directly store the user key value and we do translation through metachunk of a cublet, before processing. And applies this also to the appended blocks.

extensibility for more value types

description

As a continued efforts of code refactoring to improve the systems extensibility, it comes to our attention that we need to remove the system design built around just int and string. Many codes in the system are hardcoded and contains casting to only support int and string.

It will pave the way to support float.

idea

Have an abstract value type that moves across different operators in cool.
Have selected types at the source and end to handle the different data types.

solution

general FieldValue type with two subtypes. The two types corresponds to the other upper-level apparatus in COOL, MetaFieldRS/WS, DataFieldRS/WS, filters, etc.
- RangeField: numeric values (or maybe more general a total order set)
- HashField; with a hash function that transform to int.,
Additional IntRangeField, for explicit intent in storing the HashField converted integers (gids) in data chunks.
InputVector made generic and return FieldValue
filter and aggregator operates on FieldValue.
explicit logic on specific types are kept at InputVectorFactory, filters and aggregators, invalid arguments handled by exceptions

issues addressed

common method abstraction to avoid value type casting to int and string
removed most InputVector related casting to use instead the more specific extended interfaces to communicate our intents in handling HashField, RangeField, IntField, and internal integers.

outcome

better code readability and extensibility.

new issues caused

Issue#124 the CoolTupleReader and KeyFieldIterator needs to rework to use new APIs.

possible changes

specific abstraction for user key field and action time field

One tuple may meet two events.

One tuple may meet two events, but now we only consider whether the tuple meets one of the events.

need to update the cohortselector layout

need to support the list functions in the cohortselector layout.
or we need run multiple times to get the meaningful results.

Birth event not used for age selection

according to the current code logic, a user with only one record will never be classified into any cohort. Please fix it.

No explicit warining or exeception throwed when traverse the fields of dataChunk multipe times.

In the readStore, if we traverse one field multiple times, it reads the wrong value and does not throw expectations or warnings.

For example, in RLEInputVector readStore, if we traverse it multiple times with get, the end-offset will prevent the traverse and only return the value in the last offset, which results in a wrong read.

We should either explicitly throws exceptions to prevent multiple traverses, or allow it by removing such constraint or adding reset methods to reset the offset.

MetaChunk logic issue

In this version, MetaChunk will be continuously expanded by Cubelets. However, the MetaChunk in each cubelet should only contain the necessary information for its own storage, and we can add another storage to store the global information.

For example,
In Cubelet 1, the age range is 0 to 20. (in MetaChunk: 0-20)
In Cubelet 2, the age range is 30 to 50. (in MetaChunk: 0-50)

we should change it into
In Cubelet 1, the age range is 0 to 20. (in MetaChunk: 0-20)
In Cubelet 2, the age range is 30 to 50. (in MetaChunk: 30-50)
(in MetaChunk: 0-50 [global] )

In this way, we can skip several cubelets by checking the MetaChunk when we process cohort queries. For example, we can skip the search process in Cubelet 1 if the query is only about the users whose age is bigger than 30.

process the time in a more fine-grained manner

we can support the time in the "%d/%m/%Y" format, and we need to support a more fine-grained format, i.e., "%d/%m/%Y %H:%M:%S"

Funnel analysis reformat

Please update the function and unit test for the funnel analysis

Unify dataset, upload it into Cloud instead of github, download it automatically

TODO

Unify three dataset and add README.md for each one. the content in three dataset directory is mess.
Seperate static data from source code. upload dataset to cloud storage. Using github to store data is not a good idea.
add a script which can automatically download datasets from cloud storage and place them in right position.

Need to remove the useless files that are generated during the tests.

When we run the mvn clean package command, it will generate many Cubes and cohorts.
We should remove these useless files after finishing all test cases.

Some tool Limitations/ New features to be added

Many thanks for the great work !
I found some limitations of the tool in which we cannot use it in current state:

Data for analysis should be present as a CSV file, /Avro/HDFS, etc: But we have our data as Google Big Query Tables, we need to build a connector for the same.
While the tool allows us to run cube query, iceberg query and cohort query.: We need a set of a utility to generate this queries given the column names, writing this query is now manual and time consuming, especially if we have several features,
What is the benefit of using cool while we can run cohort analysis in BigQuery tables itself this needs to be established I cannot see the comparision done in the Paper published even
While most enterprise data resides on Big query tables escpecially for works related to modelling developing new architectures the current csv, avro, etc formats inhibits the data scientists to use the tool since we are dealing with large scale data of gigabits which need to be exported and stored even if we develop neural network models how do we overcome this limitation: Kindly suggest if this is not limitation
Looking forward to colloborate if these features are already supported and work closely to establish a Viable POC( Proof of concept)

Need for more use-cases (i.e. applications)

we need to add more use-cases to valid and improve our system.

Bug: float compression/decompression

Description

The current float compression with large diff. The reason is that the xor to represent the difference will span all 32 bits. When xor is directly casted to long type, the first bit recoganized as the sign bit will be mapped to the sign bits in long which now is a 64 bit value. That would result in diffSize being 64, while it should have been 32.

Solution

The solution is to use Integer.toUsignedLong to covert int to long. Note that the casting in the other direction is ok: casting a value from long to int will take the last 32 bit as bits for the int.

Refactor the functionality test

We need to refactor all functionality tests in the current version.

The design of Cohort Query Processing

The Pattern of Cohort Execution

Refering to the Paper of Cohona, we can break down a complete cohort query into five sub-query, shown in fig. The target of query is to

Given the launch birth action and the activity table as show in Table (check raw paper), for players who play the dwarf role at their birth time, cohort those players based on their birth countries and report the total gold that country launch cohorts spent since they were born.

We call these five steps 1). BirthSelection 2). BirthTupleGeneration 3). AgeGeneration 4). CohortAttributeGeneration 5). CohortGenration.

Actually, we can compress step 1 and step 2 into one operation. The combination of step 1 and step 2 is called BirthSelection. As for step 4, it is not neccessary for the basic cohort query, it only add some additional attribution for a kind of cohort. For above example query, the target is to count the number of player for one country which is the cohort. From the example in paper Cohort Analysis with ease. we can see that CohortAtributeGeneration is not the basic step for a cohort query. Progress is shown in fig that step a,b,c generate data in axis-x and axis-y, step e generate lines of different color.

Thus we can summary the tree step of the basic cohort query, which is also the three operator in our implementation.

BirthSelection
AgeGeneration
CohortGeneration

These three steps should be executed step by step in a CohortExecuteUnit.

Pattern Design

(Following part is for cohort query processing only. We can consider this part as a component of the whole query processing module. If we want to maintain a good open source project, we have to support base query process in the future)

There are some assumptions before we build the query execution part.

For one table, we allow append operation to add data. The data in every append operation is called a append block. Considering the application scenarios of cohort, we assume that Every Append Block contains a bunch of behavior data item within a certain time window. For the layout of the Append Block, we also assume that tuples are clustered in userId, and for one user, its tuple is sorted in time serise. These are two constrains we set for input data.

First we create a class CohortOperator to combine above three sub-process and also CohortAttributeGeneration step. (This step can be null). For three sub-process, we can create classes separately. There is no need to allow them to implement same interface, since these three classes are all needed in Cohort Processing. However, every class have to implement init and process two functions. init load filter from query.

Simply introduce the logic of every sub-process.

The logic of BirthSelection's process

Iterate append block, check whether the time feature's block obeys the condistion of filters.
For one block, iterate the Cublet, check the meta info whether possible birth tuple is existed.
If there are possible birth tuples, check the data chunk, for eligible tuple, we record the userId (String) and BirthTime in a HashMap.
process return these HashMap

The logic of AgeGeneration's process

The Iterate and necessary check pattern is like above BirthSelection's process
In data chunk, for eligible tuple, we record their unique identifier among whole table in a ArrayList (CubletId-offsetInRLE).
process return the ArrayList which contain accessible tuple. These ArrayList is the scope the rest operator applied on. Each element of this ArrayList is a Array contain accessible tuple. The index of the ArrayList is age in cohort.

The logic of CohortGeneration's process

For every element in Arraylist returned by AgeGeneration's process, We maintain HashMap <Cohort - SelectedValueList>.
Iterate these Accessible Tuple, store these value and apply certain analysis function.

In order to reduce the usage of memory, we can implement the CohortOperator as an iterator container. Everytime step is called, it return the i-th age's cohort data (HashMap).

Above content is the simply imagination of modular cohort query's execution processing. I think it can solve the continuous problem. Additionally, apart from cohort query, we should reconstruct the filter class in the next few days. Certainly, there are many other aspects to consider in code level, such as the memory usage and the query optimization, which are left in the future work.

Cube List Test fail during mvn clean package

Following test passed.
Csv data loader Test
Cube Reload Test
Cohort Selection Test
Cohort Analysis Test
Funnel Analysis Test
IceBerg Test
CohortProfilingTest Test

Below test build failed.

`======================== Cube List Test ========================
Applications: [sogamo, tpc-h-10g, health]
Tests run: 13, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.245 sec <<< FAILURE!
CohortSelectionTest(com.nus.cool.core.model.CoolModelTest) Time elapsed: 0.029 sec <<< FAILURE!
java.nio.BufferUnderflowException
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:699)
at java.base/java.nio.DirectShortBufferU.get(DirectShortBufferU.java:321)
at com.nus.cool.core.io.storevector.ZInt16Store.next(ZInt16Store.java:67)
at com.nus.cool.core.io.storevector.FoRInputVector.next(FoRInputVector.java:57)
at com.nus.cool.core.cohort.TimeUtils.getDateFromOffset(TimeUtils.java:53)
at com.nus.cool.core.cohort.ExtendedCohortSelection.getUserBirthTime(ExtendedCohortSelection.java at com.nus.cool.core.cohort.ExtendedCohortSelection.getUserBirthTime(ExtendedCohortSelection.java at com.nus.cool.core.cohort.ExtendedCohortSelection.getUserBirthTime(ExtendedCohortSelection.java:427)
at com.nus.cool.core.cohort.ExtendedCohortSelection.selectUser(ExtendedCohortSelection.java:511)
at com.nus.cool.core.cohort.CohortUserSection.process(CohortUserSection.java:150)
at com.nus.cool.model.CoolCohortEngine.selectCohortUsers(CoolCohortEngine.java:63)
at com.nus.cool.core.model.CoolModelTest.CohortSelectionTest(CoolModelTest.java:96)

Results :

Failed tests: CohortSelectionTest(com.nus.cool.core.model.CoolModelTest)

Tests run: 13, Failures: 1, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for cool 0.1-SNAPSHOT:
[INFO]
[INFO] cool ............................................... SUCCESS [02:26 min]
[INFO] cool-core .......................................... FAILURE [02:30 min]
[INFO] cool-extensions .................................... SKIPPED
[INFO] parquet-extensions ................................. SKIPPED
[INFO] hdfs-extensions .................................... SKIPPED
[INFO] arrow-extensions ................................... SKIPPED
[INFO] avro-extensions .................................... SKIPPED
[INFO] cool-queryserver ................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:57 min
[INFO] Finished at: 2022-04-18T15:58:23+05:30
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project cool-core: There are test failures.
[ERROR]
[ERROR] Please refer to /Users/i0w00to/apache/COOL/cool-core/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :cool-core
`

Issues left by cohort processor refactoring

Create an issue and submit a corresponding PR for each one.

need to add the logic to check the meta chunk to speed up the query processing.
remove aggregation logic in ValueSelection.
- ValueSelection only handles filtering (can be merged with EventSelection)
- RetUnit will encapsulate the logic of updating statistics (discussed below)
Factory class to handle FieldRS creation (better encapsulation)
rework the ProjectedTuple
- existing impl aims to mimic the old logic which introduces additional indirections
- keep it simple, let its producer decides on which indexed value to retrieve from it.
- make it immutable. avoid loadattr that mutates internal data.
- separate handling for special field: user id and action time
MetaFieldRS to create value converter/translator (extensibility and polymorphism)
- currently we only have two types, we can expect having more field types.
- The MetaFieldRS translates the token (gid for string now) stored in data chunk to actual values. (no-op for range)
Augment RetUnit
- perhaps we should rename this variable
- it will contain additional variables for max, min, etc. (now only counts)
- A solution is to have a list of variable and keep a list functions (aggregators) that take in new value and update its corresponding variable.
add documentations: DataHashFieldRS Need to add descriptions to the assumptions: all input vector ever used there have efficient implementation of getting by index (Zint, BitVector, ZintBitInput,)

Improvement in UserMetaField part

Description

UserMetaField is generated to solve the duplicated data. #91
The first version failed to achieve better encapsulation and didn't change the corresponding unittest.
Fix and improve it in PR #105

IcebergQueryTest implicit dependencies on CsvLoaderTest

EcommerceDataSetSQL1 test under IcebergQueryTest assumes that CsvLoaderTest to have loaded ecommerce data loaded in the cube repo. Yet this dependency unlike CohortSelectionTest/CohortAnalysisTest is not specified.

The field sequence of table.yaml must be the same as the sequence in each row

The field sequence of the table.YAML must be the same as the sequence in each row,

This is not convenient, the data reader could read the sequence from the header of the CSV file, and then only use the other information in table.yaml instead of field sequence.

start server issue

This is the version of JDK:

java -version
java version “18.0.1” 2022-04-19
Java(TM) SE Runtime Environment (build 18.0.1+10-24)
Java HotSpot(TM) 64-Bit Server VM (build 18.0.1+10-24, mixed mode, sharing)

After pull and mvn clean , build:

l0p018z@m-c02c75femd6r COOL % java -jar cool-queryserver/target/cool-queryserver-0.1-SNAPSHOT.jar datasetSource 8086
[] Dataset source folder datasetSource exists.
[] Start the Query Server (port: 8086)...
log4j:WARN No appenders could be found for logger (org.eclipse.jetty.util.log).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread “Thread-0” java.lang.ExceptionInInitializerError
at com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector.prepare(AccessorInjector.java:83)
at com.sun.xml.bind.v2.runtime.reflect.opt.OptimizedAccessorFactory.get(OptimizedAccessorFactory.java:176)
at com.sun.xml.bind.v2.runtime.reflect.Accessor$FieldReflection.optimize(Accessor.java:282)
at com.sun.xml.bind.v2.runtime.property.ArrayProperty.(ArrayProperty.java:69)
at com.sun.xml.bind.v2.runtime.property.ArrayERProperty.(ArrayERProperty.java:88)
at com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.(ArrayElementProperty.java:100)
at com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.(ArrayElementNodeProperty.java:62)
at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:67)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:483)
at com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128)
at com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183)
at com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532)
at com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:347)
at com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170)
at com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145)
at com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:236)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.base/java.lang.reflect.Method.invoke(Method.java:577)
at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:171)
at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:129)
at javax.xml.bind.ContextFinder.find(ContextFinder.java:307)
at javax.xml.bind.JAXBContext.newInstance(JAXBContext.java:478)
at javax.xml.bind.JAXBContext.newInstance(JAXBContext.java:435)
at org.glassfish.jersey.server.wadl.internal.WadlApplicationContextImpl.getJAXBContextFromWadlGenerator(WadlApplicationContextImpl.java:120)
at org.glassfish.jersey.server.wadl.WadlFeature.isJaxB(WadlFeature.java:74)
at org.glassfish.jersey.server.wadl.WadlFeature.configure(WadlFeature.java:56)
at org.glassfish.jersey.model.internal.CommonConfig.configureFeatures(CommonConfig.java:728)
at org.glassfish.jersey.model.internal.CommonConfig.configureMetaProviders(CommonConfig.java:647)
at org.glassfish.jersey.server.ResourceConfig.configureMetaProviders(ResourceConfig.java:823)
at org.glassfish.jersey.server.ApplicationHandler.initialize(ApplicationHandler.java:328)
at org.glassfish.jersey.server.ApplicationHandler.lambda$initialize$1(ApplicationHandler.java:293)
at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
at org.glassfish.jersey.internal.Errors.processWithException(Errors.java:232)
at org.glassfish.jersey.server.ApplicationHandler.initialize(ApplicationHandler.java:292)
at org.glassfish.jersey.server.ApplicationHandler.(ApplicationHandler.java:259)
at org.glassfish.jersey.server.ApplicationHandler.(ApplicationHandler.java:246)
at org.glassfish.jersey.jetty.JettyHttpContainer.(JettyHttpContainer.java:468)
at org.glassfish.jersey.jetty.JettyHttpContainerProvider.createContainer(JettyHttpContainerProvider.java:37)
at org.glassfish.jersey.server.ContainerFactory.createContainer(ContainerFactory.java:58)
at com.nus.cool.queryserver.QueryServer.createJettyServer(QueryServer.java:94)
at com.nus.cool.queryserver.QueryServer.run(QueryServer.java:49)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make protected final java.lang.Class java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int) throws java.lang.ClassFormatError accessible: module java.base does not “opens java.lang” to unnamed module @15fef0a0
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Method.checkCanSetAccessible(Method.java:200)
at java.base/java.lang.reflect.Method.setAccessible(Method.java:194)
at com.sun.xml.bind.v2.runtime.reflect.opt.Injector$1.run(Injector.java:177)
at com.sun.xml.bind.v2.runtime.reflect.opt.Injector$1.run(Injector.java:174)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
at com.sun.xml.bind.v2.runtime.reflect.opt.Injector.(Injector.java:172)
... 44 more

Problem

The current age calculation for week/month/year is based on translation to the fixed day interval:

COOL/cool-core/src/main/java/com/nus/cool/core/cohort/utils/DateUtils.java

Lines 70 to 75 in 80070ff

    
           case MONTH: 
        
             return new TimeWindow(d.toDays() / 30, TimeUtils.TimeUnit.MONTH); 
        
           case WEEK: 
        
             return new TimeWindow(d.toDays() / 7, TimeUtils.TimeUnit.WEEK); 
        
           case YEAR: 
        
             return new TimeWindow(d.toDays() / 365, TimeUtils.TimeUnit.YEAR);

Needs to use the date object to generate the difference directly and add tests to cover these cases.

This is related to issue #134, PR #135

	void add(String user) {
	if (!added.contains(user)) {
	userSequence.add(user);
	}
	}
	}

	case MONTH:
	return new TimeWindow(d.toDays() / 30, TimeUtils.TimeUnit.MONTH);
	case WEEK:
	return new TimeWindow(d.toDays() / 7, TimeUtils.TimeUnit.WEEK);
	case YEAR:
	return new TimeWindow(d.toDays() / 365, TimeUtils.TimeUnit.YEAR);

cool-cohort / cool Goto Github PK

cool's People

Contributors

Stargazers

Watchers

Forkers

cool's Issues

Problem

Reasons:

plan

status

description

idea

solution

issues addressed

outcome

new issues caused

possible changes

Description

Solution

The design of Cohort Query Processing

The Pattern of Cohort Execution

Pattern Design

Description

Problem

Recommend Projects

Recommend Topics

Recommend Org