The sample shows how to generate and ingest a csv file containing 10 millions rows, chunked into 10 files, each containing 1 million rows into a Online Feature Group in SageMaker Feature Store using SageMaker Processing Jobs. A SageMaker Notebook instance of type ml.c5.4xlarge was used to load/run the notebooks in this repo
Please make sure the role attached to the notebook instance has the below policies. For this sample the role has
- AmazonSageMakerFullAccess
- AmazonSageMakerFeatureStoreAccess
- AmazonS3FullAccess (limit the permissions to specific buckets)
- Data Generation - To generate sample data from seed.csv, chunk and save them to S3
- Feature Group Create/Provision - Creates the feature group you want to use
- Ingest using Ingest API in Python SDK
- Ingest using PutRecord API in PySpark
- Validation - Read from feature store to validate (a small sample of the dataset)