Code Monkey home page Code Monkey logo

wenjiedu / awesome_imputation Goto Github PK

View Code? Open in Web Editor NEW
142.0 6.0 18.0 3.15 MB

Awesome Deep Learning for Time-Series Imputation, including a must-read paper list about applying neural networks to impute incomplete time series containing NaN missing values/data

License: BSD 3-Clause "New" or "Revised" License

Python 55.26% Shell 2.01% Jupyter Notebook 42.73%
benchmark data-mining deep-learning imputation machine-learning missing-data missing-values neural-network probablistic survey

awesome_imputation's Issues

About comparison fairness and dataset splitting

Dear Authors,

Thank you for your invaluable contributions to this repository. I am currently exploring the field of time series imputation and have encountered some aspects regarding the evaluation protocols that I believe could benefit from further discussion.

  1. Dataset Splitting: The choice to split the dataset chronologically is well-suited for time series forecasting to prevent data leakage. However, for imputation tasks where the goal is to address the missingness in available data, such splitting may not be necessary. Given that the primary concern in imputation is dealing with inherently missing data, a non-chronological split might be more appropriate as it reflects real-world scenarios where all available data is subject to imputation, instead of the recent ones.
  2. Evaluation Comparisons: The evaluation process raises somewhat questions about fairness and consistency across different methods. We compare for instance the Transformer and mean imputer. While the Transformer model is assessed using test data, the approach for evaluating a mean imputer remains unclear. Should the mean imputer also have access to the test data since the non-missing data in the test data should also be available in model serving? There are two options:
  • training the mean imputer on the train-eval set is unfair since the non-missing data in the test set should be available for the mean imputer too, which does not cause leakage and has been exploited by nn models.
  • training the mean imputer exclusively on the test set does not leverage the potentially informative train-eval sets, which seems equally unfair.

In view of these points, I suggest the following:

  • For generalized imputation methods like those in HyperImpute, should we maintain merely the unavailability of missing values in the test set while considering the rest of the data as usable (including the non-missing values in the test data, the train and eval data)?
  • Could we use a non-chronological train-val-test split, given that in practical applications, the emphasis is on imputing the entire dataset rather than the recent months? More importantly, in the case of missing value imputation, the non-missing data is often unavailable (kindly see the protocol of HyperImpute for reference)

I look forward to your insights and any suggestions you might have on aligning the evaluation framework with real-world imputation tasks.

Best regards,

Hao

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.