Code Monkey home page Code Monkey logo

Comments (6)

chenyangkang avatar chenyangkang commented on June 8, 2024

Hi Chris,

Thank you for your valuable comments. I agree on most of the issues you mentioned. Here are the replies and solutions for them:

  1. About how time and space are encoded:

Do input dataframes require a DOY column?

No, the target spatial and temporal indexing columns are flexible, and could be passed as arguments.

Does anything change if data are provided at other temporal scales (weekly, monthly, yearly, etc.)?

No, the scale of the indexing is highly flexible and user-defined.

I see latitude and longitude encoded in the mini test data - is that the only supported CRS?

No. Any indexing system could be passed to AdaSTEM framework. If user wish to use other CRS, they could transform it in prior.

I see that geopandas is a dependency - does the model class support passing a GeoDataFrame, or do you need to explicitly encode column names?

This is a mistake. geopandas is not a dependency of stemflow. I removed them from requirement, for both pypi and conda. Currently we only support pandas dataframe as training data input (there is a type check). Geopandas dataframe can be tranfomed to pd.core.frame.DataFrame beforehand.

All these issues are documented in the new documentation page Tips for spatiotemporal indexing, Tips for data types as well as listed as features in the home page (section Model and data).

Relevant PR(s): #14

from stemflow.

chenyangkang avatar chenyangkang commented on June 8, 2024
  1. Thank you for your advice on more guidance for feature data. Here comes the replies and solutions:

But it looks like the mini_data example is nearly all static features, with DOY being the only dynamic variable. Is this how other datasets are expected to be formatted? Can users provide a combination of static and dynamic covariates?

This is a great question. You are right, the datasets are formatted with static features except for DOY. This is for specific reasons. Including:

  • Model assumption
  • Scale of interest
  • Consideration for overfitting
  • Align with prediction set

However, dynamic features are absolutely suitable for training, or combination of static and dynamic ones.

The details are discussed in the additional documentation Tips for data types.

Are categorical features supported? More clarification regarding best practices for how to extract and format covariate data would help a lot.

Yes, categorical features are supported. We mentioned the feature formatting in Tips for data types.

Relevant PR(s): #14

from stemflow.

chenyangkang avatar chenyangkang commented on June 8, 2024
  1. Thank you for your advice on more detailed annotation on notebooks. I have add additional comments and annotation to all notebooks. For example, please see here.

In the intro notebook, for example, why were the parameters Spatio_blocks_count=50, Temporal_blocks_count=50 selected? what would turning these numbers up and down do?

Thank you for pointing out this specific issue. We explained the choice of spatiotemporal train-test-split parameters in AdaSTEM demo, as they approximately catch the spatiotemporal scale of bird migration.

What are the grid_len_{}_{}_threshold parameters, and what do those defaults encode?

Thank you for pointing out this confusion. To help user better understand the parameters, we add inline comments at the home page usage example and AdaSTEM demo, such as:

model = AdaSTEMRegressor(
    base_model=Hurdle(
        classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
        regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
    ),                                            # hurdel model for zero-inflated problem (e.g., count)
    save_gridding_plot = True,
    ensemble_fold=10,                             # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
    min_ensemble_required=7,                      # Only points covered by > 7 stixels will be predicted
    grid_len_upper_threshold=25,              # force splitting if the longitudinal edge of grid exceeds 25
    grid_len_lower_threshold=5,               # stop splitting if the longitudinal edge of grid fall short 5
    temporal_start=1,                           # The next 4 params define the temporal sliding window
    temporal_end=366,                            
    temporal_step=20,
    temporal_bin_interval=50,
    points_lower_threshold=50,                    # Only stixels with more than 50 samples are trained
    Spatio1='longitude',                          # The next three params define the name of 
    Spatio2='latitude',                         # spatial coordinates shown in the dataframe
    Temporal1='DOY',
    use_temporal_to_train=True,                   # In each stixel, whether 'DOY' should be a predictor
    njobs=1
)

Tracing back to the original function is often still challenging, as there are many parameters with slightly varying names that are still hard to understand. As a user I would appreciate more narrative clarification.

I agree with you. For the first-time user, some parameter settings could be overwhelming. I hope the additional inline annotation, documentation, and notebooks could help clarify it.

Relevant PR(s): #16, #17, #18

from stemflow.

chenyangkang avatar chenyangkang commented on June 8, 2024
  1. Thank you for your comments on adding more explanation about functionality

I'm not familiar with the Hurdle modeling approach, which seems great and appears foundational to AdaSTEM - tips on best practices here would be valuable.

Definitely true. I add more information about what tasks stemflow could do, especially what is hurdle in the new document Tips for different tasks. We also mention the function repertoire at the home page.

You mentioned using other base models from sklearn in the manuscript - when would this be advantageous?

Thank you for your questions. We add additional analysis and bump chart for comparing the performance of using different base model in Base model choices. Basically the trade-off exists among "performance – speed – memory usage". stemflow is a bagging modeling framework, which makes it time-consuming and memory-consuming. We would want the best trade-off between performance and resources consumption.

You also provide great examples for comparing learning curves and optimizing strixel sizes, but these notebooks are just code blocks with some plots and little interpretation. I found myself wanting more guidance for how to interpret these results, or even guidance on why such optimization is important (do you optimize strixel size to minimize overfitting, for example?).

Thank you for your suggestions. I add additional charts, comments, and conclusion at section Optimizing stixel size for addressing why such optimization is important and how to optimize the stixels. I hope this helps users in understanding the parameter choices.

Relevant PR(s): #15, #16, #17, #18

from stemflow.

chenyangkang avatar chenyangkang commented on June 8, 2024

Still exploring the possibility of using geopandas as dependency #30

from stemflow.

chenyangkang avatar chenyangkang commented on June 8, 2024

Not using geo-features for reasons mentioned in #30

from stemflow.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.