Comments (6)
Hi Chris,
Thank you for your valuable comments. I agree on most of the issues you mentioned. Here are the replies and solutions for them:
- About how time and space are encoded:
Do input dataframes require a DOY column?
No, the target spatial and temporal indexing columns are flexible, and could be passed as arguments.
Does anything change if data are provided at other temporal scales (weekly, monthly, yearly, etc.)?
No, the scale of the indexing is highly flexible and user-defined.
I see latitude and longitude encoded in the mini test data - is that the only supported CRS?
No. Any indexing system could be passed to AdaSTEM framework. If user wish to use other CRS, they could transform it in prior.
I see that geopandas is a dependency - does the model class support passing a GeoDataFrame, or do you need to explicitly encode column names?
This is a mistake. geopandas
is not a dependency of stemflow
. I removed them from requirement, for both pypi and conda. Currently we only support pandas
dataframe as training data input (there is a type check). Geopandas dataframe can be tranfomed to pd.core.frame.DataFrame
beforehand.
All these issues are documented in the new documentation page Tips for spatiotemporal indexing, Tips for data types as well as listed as features in the home page (section Model and data
).
Relevant PR(s): #14
from stemflow.
- Thank you for your advice on more guidance for feature data. Here comes the replies and solutions:
But it looks like the mini_data example is nearly all static features, with DOY being the only dynamic variable. Is this how other datasets are expected to be formatted? Can users provide a combination of static and dynamic covariates?
This is a great question. You are right, the datasets are formatted with static features except for DOY
. This is for specific reasons. Including:
- Model assumption
- Scale of interest
- Consideration for overfitting
- Align with prediction set
However, dynamic features are absolutely suitable for training, or combination of static and dynamic ones.
The details are discussed in the additional documentation Tips for data types.
Are categorical features supported? More clarification regarding best practices for how to extract and format covariate data would help a lot.
Yes, categorical features are supported. We mentioned the feature formatting in Tips for data types.
Relevant PR(s): #14
from stemflow.
- Thank you for your advice on
more detailed annotation on notebooks
. I have add additional comments and annotation to all notebooks. For example, please see here.
In the intro notebook, for example, why were the parameters Spatio_blocks_count=50, Temporal_blocks_count=50 selected? what would turning these numbers up and down do?
Thank you for pointing out this specific issue. We explained the choice of spatiotemporal train-test-split parameters in AdaSTEM demo, as they approximately catch the spatiotemporal scale of bird migration.
What are the grid_len_{}_{}_threshold parameters, and what do those defaults encode?
Thank you for pointing out this confusion. To help user better understand the parameters, we add inline comments at the home page usage example and AdaSTEM demo, such as:
model = AdaSTEMRegressor(
base_model=Hurdle(
classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
), # hurdel model for zero-inflated problem (e.g., count)
save_gridding_plot = True,
ensemble_fold=10, # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
min_ensemble_required=7, # Only points covered by > 7 stixels will be predicted
grid_len_upper_threshold=25, # force splitting if the longitudinal edge of grid exceeds 25
grid_len_lower_threshold=5, # stop splitting if the longitudinal edge of grid fall short 5
temporal_start=1, # The next 4 params define the temporal sliding window
temporal_end=366,
temporal_step=20,
temporal_bin_interval=50,
points_lower_threshold=50, # Only stixels with more than 50 samples are trained
Spatio1='longitude', # The next three params define the name of
Spatio2='latitude', # spatial coordinates shown in the dataframe
Temporal1='DOY',
use_temporal_to_train=True, # In each stixel, whether 'DOY' should be a predictor
njobs=1
)
Tracing back to the original function is often still challenging, as there are many parameters with slightly varying names that are still hard to understand. As a user I would appreciate more narrative clarification.
I agree with you. For the first-time user, some parameter settings could be overwhelming. I hope the additional inline annotation, documentation, and notebooks could help clarify it.
from stemflow.
- Thank you for your comments on
adding more explanation about functionality
I'm not familiar with the Hurdle modeling approach, which seems great and appears foundational to AdaSTEM - tips on best practices here would be valuable.
Definitely true. I add more information about what tasks stemflow
could do, especially what is hurdle
in the new document Tips for different tasks. We also mention the function repertoire at the home page.
You mentioned using other base models from sklearn in the manuscript - when would this be advantageous?
Thank you for your questions. We add additional analysis and bump chart for comparing the performance of using different base model in Base model choices. Basically the trade-off exists among "performance – speed – memory usage". stemflow
is a bagging modeling framework, which makes it time-consuming and memory-consuming. We would want the best trade-off between performance and resources consumption.
You also provide great examples for comparing learning curves and optimizing strixel sizes, but these notebooks are just code blocks with some plots and little interpretation. I found myself wanting more guidance for how to interpret these results, or even guidance on why such optimization is important (do you optimize strixel size to minimize overfitting, for example?).
Thank you for your suggestions. I add additional charts, comments, and conclusion at section Optimizing stixel size for addressing why such optimization is important
and how to optimize the stixels
. I hope this helps users in understanding the parameter choices.
Relevant PR(s): #15, #16, #17, #18
from stemflow.
Still exploring the possibility of using geopandas as dependency #30
from stemflow.
Not using geo-features for reasons mentioned in #30
from stemflow.
Related Issues (18)
- [REVIEW] consider adding automated testing HOT 2
- [REVIEW] consider updating contributor guidelines HOT 2
- [JOSS review] paper revision
- [JOSS review] Improve cartesian indexing system HOT 2
- [JOSS review] Add spherical indexing system HOT 1
- [JOSS review] Documentation revision HOT 1
- [Feature] STEM module HOT 1
- [Feature] Speed boost? Using Geo indexing dependency HOT 2
- Revise the documentation for sphere indexing and mini_test
- [Feature] Make parallel computing available HOT 1
- Add Numba for numpy operation optimization HOT 1
- [BUG] unique_stixel_id HOT 3
- Add Spatial and Temporal Scale Warnings
- NAs detection
- Check random state for all function
- Gridding params grid search
- [BUG]plot_gif only work for global and WGS84 currently HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stemflow.