Analysis is a DAG. The sequence in this DAG is critical, so more preion would be

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

This is tremendously lucid, <a class="user-mention notranslate" data-hovercard-type="u

You don't need to export or <cod

Unclear how to use AWS and `make data` about cookiecutter-data-science HOT 4 CLOSED

drivendataorg commented on August 20, 2024

Unclear how to use AWS and `make data`

from cookiecutter-data-science.

Comments (4)

isms commented on August 20, 2024

Hi @aryamccarthy, thanks for opening this issue. If you have questions then others probably do too, so it's worth adding the the following explanation to the docs, but to answer the questions:

The make data command has no relationship with storing or retrieving data from S3. It's intended to kick off the data pipeline going from raw data on the local machine to processed/cleaned data on the local machine.
What happens when you run make sync_data_{to,from}_s3 is defined in the Makefile here.
- It uses the AWS command line tool ("awscli") to sync the data/ folder up to or down from your S3 bucket.
- You S3 bucket is defined in the Makefile here from when you started the cookiecutter, or manually after the project is already created.
- awscli is actually just a Python package and is a requirement of the project (see requirements.txt), and can be configured to store your AWS credentials in its own location, normally the .aws folder in your home folder.

So far, none of these things touch .env because nowhere are secrets being accessed programmatically by Python code within the project. However, if you wanted to interact with AWS or S3 in your code, the best practice would be to define the two credential variables — typically called AWS_ACCESS_KEY_ID (non-sensitive) and AWS_SECRET_ACCESS_KEY (very sensitive) — in the .env file.

(NB: these credentials might actually be different and more limited than your system-wide awscli profile; imagine an IAM role set up so that users with the specified credentials could only access the project bucket and not, for example, your personal website bucket.)

That's where @theskumar's excellent python-dotenv project comes in: see an example usage in our docs.

Does that all make sense? Let me know if anything still wasn't clear.

from cookiecutter-data-science.

aryamccarthy commented on August 20, 2024

This is tremendously lucid, @isms. Thanks for writing back. I do see the example of those AWS values being put in the .env file, but one would need to export them, no?

Another point of clarification in the documentation would be this: The docs describe that make_dataset.py should operate on /data/interim; how does data get there? I assume it's proper to use make sync_data_from_s3 to populate /data/raw, but the steps in between are fuzzy. My team has taken to just loading data straight into interim.

from cookiecutter-data-science.

isms commented on August 20, 2024

You don't need to export or source any files if you're using dotenv. Since the export command is used to set a variable in your OS environment, you could source the .env file in your command line and then run the script in the same shell only to later retrieves those variables in Python using os.environ.

On the other hand, the dotenv tool end-runs the OS/shell part of this by using Python to avoid the OS part and set these variables from your .env in the first place. The outcome is equivalent, except the first requires stuff to happen in the shell and then Python, the latter all happens programmatically in Python.
The way we think about the data folders is that all of the original data starts in raw/. As the data pipeline runs, intermediate results of some analysis interest or which may be reused in other scripts get put in interim/. Then final-ish cleaned data gets put in processed/. The make_dataset.py is just a stubbed example, nothing in there really defines that you "must" use interim -- some projects might not need it at all.

But imagine a case where you have 100GB of raw data living in data/raw/address_history.csv like timestamp,person_id,zip_code,... and you're trying to figure out something about mobility like how many people have moved from certain zip codes to others. Maybe the final, clean data set that gets put in processed/ for further analysis is a big, binary Numpy matrix NxN matrix (call it M) where N is the number of unique zip codes and and the count in each cell is how many people moved from the row zip code to the column zip code. What you end up with is a big, unlabeled Numpy int64 matrix with N rows and N columns but no way of figuring out how to get back from M_ij to two actual zip codes. (OK, so this is a little contrived but it represents a lot of common cases.) What we might do here is during the data cleaning script we'd first get the canonical list of zip codes in proper 0-based order and save that array in data/interim/canonical_zip_codes.csv so it can be used in other parts of the data processing pipeline like model training or maybe later ad hoc analysis in a notebook.

Ideally, the pipeline is deterministic so any other team member should be able to start with nothing but the stuff in data/raw/ and get the rest by running the project. However with larger data sets sometimes you want to cheat and sync up whole folders, or just move your work in progress e.g. from your laptop to a beefy EC2 instance, in which case you might make sync_data_to_s3

from cookiecutter-data-science.

pjbull commented on August 20, 2024

This seems resolved. Closing.

from cookiecutter-data-science.

Unclear how to use AWS and `make data` about cookiecutter-data-science HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent