Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thank you for your advice on more guidance for feature data</s

Thank you for your advice on more detailed a

Thank you for your comments on adding more e

Not using geo-features for reasons mentioned in <a class="issue-link js-issue-link" da

[REVIEW] JOSS feedback about stemflow HOT 6 CLOSED

earth-chris commented on June 8, 2024

[REVIEW] JOSS feedback

from stemflow.

Comments (6)

chenyangkang commented on June 8, 2024

Hi Chris,

Thank you for your valuable comments. I agree on most of the issues you mentioned. Here are the replies and solutions for them:

About how time and space are encoded:

Do input dataframes require a DOY column?

No, the target spatial and temporal indexing columns are flexible, and could be passed as arguments.

Does anything change if data are provided at other temporal scales (weekly, monthly, yearly, etc.)?

No, the scale of the indexing is highly flexible and user-defined.

I see latitude and longitude encoded in the mini test data - is that the only supported CRS?

No. Any indexing system could be passed to AdaSTEM framework. If user wish to use other CRS, they could transform it in prior.

I see that geopandas is a dependency - does the model class support passing a GeoDataFrame, or do you need to explicitly encode column names?

This is a mistake. geopandas is not a dependency of stemflow. I removed them from requirement, for both pypi and conda. Currently we only support pandas dataframe as training data input (there is a type check). Geopandas dataframe can be tranfomed to pd.core.frame.DataFrame beforehand.

All these issues are documented in the new documentation page Tips for spatiotemporal indexing, Tips for data types as well as listed as features in the home page (section Model and data).

Relevant PR(s): #14

from stemflow.

chenyangkang commented on June 8, 2024

Thank you for your advice on more guidance for feature data. Here comes the replies and solutions:

But it looks like the mini_data example is nearly all static features, with DOY being the only dynamic variable. Is this how other datasets are expected to be formatted? Can users provide a combination of static and dynamic covariates?

This is a great question. You are right, the datasets are formatted with static features except for DOY. This is for specific reasons. Including:

Model assumption
Scale of interest
Consideration for overfitting
Align with prediction set

However, dynamic features are absolutely suitable for training, or combination of static and dynamic ones.

The details are discussed in the additional documentation Tips for data types.

Are categorical features supported? More clarification regarding best practices for how to extract and format covariate data would help a lot.

Yes, categorical features are supported. We mentioned the feature formatting in Tips for data types.

Relevant PR(s): #14

from stemflow.

chenyangkang commented on June 8, 2024

Thank you for your advice on more detailed annotation on notebooks. I have add additional comments and annotation to all notebooks. For example, please see here.

In the intro notebook, for example, why were the parameters Spatio_blocks_count=50, Temporal_blocks_count=50 selected? what would turning these numbers up and down do?

Thank you for pointing out this specific issue. We explained the choice of spatiotemporal train-test-split parameters in AdaSTEM demo, as they approximately catch the spatiotemporal scale of bird migration.

What are the grid_len_{}_{}_threshold parameters, and what do those defaults encode?

Thank you for pointing out this confusion. To help user better understand the parameters, we add inline comments at the home page usage example and AdaSTEM demo, such as:

model = AdaSTEMRegressor(
    base_model=Hurdle(
        classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
        regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
    ),                                            # hurdel model for zero-inflated problem (e.g., count)
    save_gridding_plot = True,
    ensemble_fold=10,                             # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
    min_ensemble_required=7,                      # Only points covered by > 7 stixels will be predicted
    grid_len_upper_threshold=25,              # force splitting if the longitudinal edge of grid exceeds 25
    grid_len_lower_threshold=5,               # stop splitting if the longitudinal edge of grid fall short 5
    temporal_start=1,                           # The next 4 params define the temporal sliding window
    temporal_end=366,                            
    temporal_step=20,
    temporal_bin_interval=50,
    points_lower_threshold=50,                    # Only stixels with more than 50 samples are trained
    Spatio1='longitude',                          # The next three params define the name of 
    Spatio2='latitude',                         # spatial coordinates shown in the dataframe
    Temporal1='DOY',
    use_temporal_to_train=True,                   # In each stixel, whether 'DOY' should be a predictor
    njobs=1
)

Tracing back to the original function is often still challenging, as there are many parameters with slightly varying names that are still hard to understand. As a user I would appreciate more narrative clarification.

I agree with you. For the first-time user, some parameter settings could be overwhelming. I hope the additional inline annotation, documentation, and notebooks could help clarify it.

Relevant PR(s): #16, #17, #18

from stemflow.

chenyangkang commented on June 8, 2024

Thank you for your comments on adding more explanation about functionality

I'm not familiar with the Hurdle modeling approach, which seems great and appears foundational to AdaSTEM - tips on best practices here would be valuable.

Definitely true. I add more information about what tasks stemflow could do, especially what is hurdle in the new document Tips for different tasks. We also mention the function repertoire at the home page.

You mentioned using other base models from sklearn in the manuscript - when would this be advantageous?

Thank you for your questions. We add additional analysis and bump chart for comparing the performance of using different base model in Base model choices. Basically the trade-off exists among "performance – speed – memory usage". stemflow is a bagging modeling framework, which makes it time-consuming and memory-consuming. We would want the best trade-off between performance and resources consumption.

You also provide great examples for comparing learning curves and optimizing strixel sizes, but these notebooks are just code blocks with some plots and little interpretation. I found myself wanting more guidance for how to interpret these results, or even guidance on why such optimization is important (do you optimize strixel size to minimize overfitting, for example?).

Thank you for your suggestions. I add additional charts, comments, and conclusion at section Optimizing stixel size for addressing why such optimization is important and how to optimize the stixels. I hope this helps users in understanding the parameter choices.

Relevant PR(s): #15, #16, #17, #18

from stemflow.

chenyangkang commented on June 8, 2024

Still exploring the possibility of using geopandas as dependency #30

from stemflow.

chenyangkang commented on June 8, 2024

Not using geo-features for reasons mentioned in #30

from stemflow.

[REVIEW] JOSS feedback about stemflow HOT 6 CLOSED

Comments (6)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent