gentzkow / template_archive Goto Github PK

Python 16.66% R 3.20% Stata 4.39% TeX 74.20% Shell 0.22% Julia 1.32%

template_archive's Introduction

README

Requirements
Setup
Running Package Scripts in Other Languages
Adding Packages
Command Line Usage
User Configuration
Windows Differences
License

Requirements

Note: The application requirements and setup instructions outlined below are intended to serve general users. To build the repository as-is, the following applications are required:

R
Stata
Python
git
git lfs
LyX
A TeX distribution for your local OS (for example, MacTeX for MacOS).

You may download the latest versions of each. By default, the Setup instructions below will assume their usage. Note that some of these applications must also be invocable from the command line. See the Command Line Usage section for details on how to set this up. Note that if you wish to run Julia scripts in your repository, you will additionally need to install Julia and set up its command line usage. Julia is currently not required to build the repository as-is. If you are planning to use a conda environment for development (see instructions below), you are not required to have local installations or enable command line usage of Stata, R, Python, or Julia (although this is recommended).

You must set up a personal GitHub account to clone private repositories on which you are a collaborator. For public repositories (such as template), Git will suffice. You may need to set up Homebrew if git and git-lfs are not available on your local computer.

If you are using MacOS, ensure your terminal is operating in bash rather than the default zsh. MacOS users who are running template on an Apple Silicon chip will instead want to use Rosetta as their default terminal. You can find instructions on how to shift from zsh to Rosetta here and here.

WindowsOS users (with Version 10 or higher) will need to switch to bash from PowerShell. To do this, you can run bash from within a PowerShell terminal (you must have installed git first).

Once you have met these OS and application requirements, clone a team repository from GitHub and proceed to Setup.

Setup

Create a config_user.yaml file in the root directory. An example can be found in the /setup directory. If this step is skipped, the default config_user.yaml will be copied over when running check_setup.py below. You might skip this step if you do not want not to specify any external paths, or want to use default executable names. See the User Configuration section below for further details.
Initialize git lfs. From the root of the repository, run:

   git lfs install
   ./setup/lfs_setup.sh
   git lfs pull

This will not affect files that ship with the template (which use the standard git storage). The first command will initialize git lfs for usage. The second command will instruct git lfs to handle files with extensions such as .pdf, .png, etc. The third command will download large files from the remote repository to your local computer, if any exist. See here for more on how to modify your git lfs settings.

Note that it is not required to initialize git lfs to work with the files hosted on template, but it is highly recommended that you initialize git lfs for large file storage by running the script above.

If you already have conda setup on your local machine, feel free to skip this step. If not, this will install a lightweight version of conda that will not interfere with your local Python and R installations.

NOTE: If you do not wish to install conda, proceed to steps 6 - 8 (installing conda is recommended).

Install miniconda to be used to manage the R/Python virtual environment, if you have not already done this. If you have homebrew (which can be download here) miniconda can be installed as follows:

    brew install --cask miniconda

Once you have installed conda, you need to initialize conda by running the following commands and restarting your terminal:

    conda config --set auto_activate_base false
    conda init $(echo $0 | cut -d'-' -f 2)

Next, create a conda environment with the commands:

    conda config --set channel_priority strict
    conda env create -f setup/conda_env.yaml

By default, we recommend users to run conda config --set channel_priority strict to speed up the environment build time. In a "strict" channel priority, packages in lower priority channels are not considered if a package with the same name appears in a higher priority channel. However, if there are package version conflicts when building the environment, consider removing this condition by running conda config --set channel_priority flexible. See the conda User Guide for more information.

The default name for the conda environment is template. This can be changed by editing the first line of /setup/conda_env.yaml. To activate the conda virtual environment, run:

    conda activate <project_name>

The conda environment should be active throughout setup, and whenever executing modules within the project in the future. You can deactivate the environment with:

conda deactivate <project_name>

Please ensure that your conda installation is up to date before proceeding. If you experience issues building your conda environment, check the version of your conda installation and update it if needed by running:

conda -V
conda update -n base -c defaults conda

Then, proceed to rebuild the environment.

Fetch gslab_make submodule files. We use a Git submodule to track our gslab_make dependency in the /lib/gslab_make folder. After cloning the repository, you will need to initialize and fetch files for the gslab_make submodule. One way to do this is to run the following bash commands from the root of the repository:

   git submodule init
   git submodule update

Once these commands have run to completion, the /lib/gslab_make folder should be populated with gslab_make. For users with miniconda, proceed to step 7.

For users who do not want to install miniconda, follow the instructions in /setup/dependencies.md to manually download all required dependencies. Ensure you download the correct versions of these packages. Proceed to step 7.
Run the script /setup/check_setup.py. One way to do this is to run the following bash command from the /setup directory (note that you must be in the /setup directory, and you must have local installations of the softwares documented in Requirements. for the script to run successfully):

   python check_setup.py

To build the repository, run the following bash command from the root of repository:
```
python run_all.py
```

Adding Packages

Note: These instructions are relevant for users who have installed miniconda. If you have not done so, consult /setup/dependencies.md.

Python

Add any required packages to /setup/conda_env.yaml. If possible add the package version number as well. If there is a package that is not available from conda, add this to the pip section of the yaml file. In order to not re-run the entire environment setup you can download these individual files from conda with the command:

conda install -c conda-forge --name <environment name> <package_name=version_number>

R

Add any required packages that are available via CRAN to /setup/conda_env.yaml. These must be prepended with r-. If there is a package that is only available from GitHub and not from CRAN, add this package to /setup/setup_r.r (after copying this script from /extensions). These individual packages can be added in the same way as Python packages above (with the r- prepend). Note that you may need to install the latest version of conda as outlined in the setup instructions above to properly load packages.

Stata

Install Stata dependencies using /setup/download_stata_ado.do (copy download_stata_ado.do from /extensions to /setup first). We keep all non-base Stata ado files in the lib subdirectory, so most non-base Stata ado files will be versioned. To add additional Stata dependencies, use the following bash command from the setup subdirectory:

stata-mp -e download_stata_ado.do

Julia

First, add any required Julia packages to julia_conda_env.jl. Follow the same steps described in Setup to build and activate your conda environment, being sure to uncomment the line referencing julia in /setup/conda_env.yaml before building the environment. Once the environment is activated, run the following line from the /setup directory:

julia julia_conda_env.jl

Then, ensure any Julia scripts are properly referenced in the relevant make.py scripts with the prefix gs.run_julia, and proceed to run run_all.py.

Command Line Usage

For instructions on how to set up command line usage, refer to the repo wiki.

By default, the repository assumes these executable names for the following applications:

application : executable

python      : python
git-lfs     : git-lfs
lyx         : lyx
r           : Rscript
stata       : stata-mp (this will need to be updated if using a version of Stata that is not Stata-MP)
julia       : julia

Default executable names can be updated in config_user.yaml. For further details, see the User Configuration section.

User Configuration

config_user.yaml contains settings and metadata such as local paths that are specific to an individual user and should not be committed to Git. For this repository, this includes local paths to external dependencies as well as executable names for locally installed software.

Required applications may be set up for command line usage on your computer with a different executable name from the default. If so, specify the correct executable name in config_user.yaml. This configuration step is explained further in the repo wiki.

Running Package Scripts in Other Languages

By default, this template is set up to run Python scripts. The template is, however, capable of running scripts in other languages too (make-scripts are always in Python, but module scripts called by make-scripts can be in other languages).

The directory /extensions includes the code necessary to run the repo with R and Stata scripts. Only code that differs from the default implementation is included. For example, to run the repo using Stata scripts, the following steps need to be taken.

Replace /analysis/make.py with /extensions/stata/analysis/make.py and /data/make.py with /extensions/stata/data/make.py.
Copy contents of /extensions/stata/analysis/code to /analysis/code and contents of /extensions/stata/data/code to /data/code.
Copy .ado dependencies from /extensions/stata/lib/stata to /lib/stata. Included are utilities from the repo gslab_stata.
Copy setup script from /extensions/stata/setup to /setup.

Windows Differences

The instructions in template are applicable to Linux and Mac users. However, with just a few tweaks, this repo can also work on Windows.

If you are using Windows, you may need to run certain bash commands in administrator mode due to permission errors. To do so, open your terminal by right clicking and selecting Run as administrator. To set administrator mode on permanently, refer to the repo wiki.

The executable names are likely to differ on your computer if you are using Windows. Executable names for Windows generally resemble:

application : executable
python      : python
git-lfs     : git-lfs
lyx         : LyX#.# (where #.# refers to the version number)
r           : Rscript
stata       : StataMP-64 (will need to be updated if using a version of Stata that is not Stata-MP or 64-bit)
julia       : julia

To download additional ado files on Windows, you will likely have to adjust this bash command:

stata_executable -e download_stata_ado.do

stata_executable refers to the name of your Stata executable. For example, if your Stata executable was located in C:\Program Files\Stata15\StataMP-64.exe, you would want to use the following bash command:

StataMP-64 -e download_stata_ado.do

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

template_archive's People

Contributors

Stargazers

Watchers

template_archive's Issues

Fix issue with lfs files in child repos

After creating a new repo using the template, we noticed the following error upon cloning:

Error downloading object: analysis/output/regression.csv (b98f379): 
Smudge error: Error downloading analysis/output/regression.csv
(b98f379011017bf4300cbf1f96c0be913478cd5794b3e5622fc499cf3be2430e): [b98f379011017bf4300cbf1f96c0be913478cd5794b3e5622fc499cf3be2430e] Object does not exist on the server: 
[404] Object does not exist on the server

Errors logged to /Users/zahedian/Documents/race-in-notes/.git/lfs/logs/20220224T162457.941482.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: analysis/output/regression.csv: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

From this thread online, we determined that the cause of the issue is cloning files stored in the template with git-lfs. These files don't properly transfer to a repo created from the template.

In this issue, we seek to correct this issue with repos created from gentzkow/template.

Incorporate updated gslab_make into template

Implementation of the following tasks (see discussion here):

Summary:

Develop setup tools (i.e., setup.py)
Create template folder structure and run.py script at the module level
Create template master run.py script at the repository level
Write README.md
Develop replicability tools

Update RA manual and standardize language across documentation

Update RA manual to incorporate from procedures from latest version of gslab_make
Standardize language across documentation

Revise run_all.py

Make sure the build script runs correctly after following the setup process. snairdesai and I both got the same error after running run_all.py on the terminal. The text of the error on the terminal can be seen on template_build_error terminal.txt.

Update linking format and functionalities

Continuation of to-dos discussed in #9.

Reverse columns in link_map.log
Symlinks in link_map.log should be printed relative to the module root
Targets in input.txt should be communicated relative to the repository root
Targets in external.txt should be communicated using keys for config_local.yaml

Update documentation

Goals:

Update setup instructions in README.md
Add further comments to example scripts
Add general user guides

Discuss image file format

We were considering changing the default image format from ".pdf" to another file format. The main reason is that, in spite of being a format that is simple to work with, pdf files are not diffable in GitHub. @gentzkow mentioned that there are two main dimensions we care about: (i) diffability and (ii) vector rather than bitmap (which means plots always render at full resolution). We first considered shifting to .png, but that is a raster image file, so it does not fulfill (ii). It seems that .svg is a good candidate that fits all criteria. @meyer-carl further added that exporting SVG files should not be a problem for any of the programs that we use and using them on LyX/LaTeX should just require a proper set up.

@jmshapir @snairdesai @rcalvo12 @ew487 we would appreciate your thoughts on this.

Edit ReadME for accurate installation instructions

Hello @gentzkow,
The step (2) of the README to install conda and oracle_jdk says :

brew cask install miniconda
brew cask install oracle-jdk

This turns out to be an erroneous syntax, and should be :

brew install --cask miniconda
brew install --cask oracle-jdk

If you're ok, i am going to amend the README for correct instructions.

EDIT: I am not able to assign myself, and probably don't have admin on this repo to make ammendments on branch/merge.

Julia extension

In this issue we will test the run_julia wrapper from gslab_make (see gslab-econ/gslab_make#54).

Task list

Test if Julia + packages can be added to the conda environment.
Run Julia scripts from root using the run_julia wrapper

Implement coordinating mechanism for package versions

The goal of this issue is to incorporate into the current template a way to standardize the versions of packages used in a given project. Currently, necessary packages can be downloaded via the programs in setup/. It may still be the case that individuals end up with different packages on their local computers, which can lead to errors or differences in output. As of now, there is no mechanism by which we ensure that all necessary programs have in fact been downloaded, nor do we ensure that the users have the expected version of a package. We seek to resolve that potential pitfall in this issue.

tablefill whitespace

@zkashner,

In one of Matt's projects, I ended up having to change https://github.com/gentzkow/template/blob/3369531210cf54820225c0bbe7c05d0aec631e80/lib/gslab_make/tablefill.py#L51

    data = [null if value.strip() in null_strings else value for value in data]

because my entries were getting saved as " . " for some reason with extra whitespace. Stripping whitespace seems benign for this check of null values in tables (e.g., " NA " or " . " or " ") .

Flagging in case you agree and think its worthwhile to implement.

Update CRAN Mirror

Similar to what was done in gslab-econ/template, I need to update the CRAN mirror to be one that is operational.

Docker pilot run

In this issue, we will demonstrate replication of a simple project within a Docker container.

Edits to Wiki and README

This issue is a follow-up to #40. Its purpose is to implement edits to the repo Wiki and README outlined in this comment.

Review draft template and RA manual

Steps

Read through the current draft of the RA manual on the wiki.
Look over the current sketch of the template on the master branch of this repo (here). Don't worry much about the details -- it's incomplete and many things are just placeholders, but it should give you some sense of where we're headed.

Add license

It occurs to me that we should add a standard open source license to the readme for the template now that it is public.

Can you review the options for that and make a recommendation? We want to allow others to use and modify freely but require them to acknowledge us as the source.

Proposed revisions to standard template process

This issue follows from #59, and our conversations with Hunt Allcott's RAs regarding their issues using the current template procedure across different OS. The proposed revisions in this issue will be distinct from the Docker development workflow proposals in #56, which we hope to add as a separate template workflow for capable users.

Below are a list of issues raised by the Hunt RAs:

Arjun (MacOS)

The chips.csv and tv.csv files in template were not properly copied down to the local repository.
The run_all.py script did not fetch the correct directory due to an issue with the run_module() function in gslab_make.
The instructions on command line usage for Lyx did not work.
There was an error in line 56 of make.py in the paper_slides module, where the Lyx document did not compile and was exported to PDF format.
There was another error when attempting to compile the slides.lyx file with run_all.py.

Zimei (WindowsOS)

The chips.csv and tv.csv files in template were not properly copied down to the local repository.
There were issues with Python version conflicts related to creating the conda environment.
The directory in which check_setup.py was ran had to be changed in gslab_make.
To append a new table in paper_slides, Zimei modified the tablefill() function to feature multiple tables.
There were similar errors with compiling the .lyx file extensions.
Zimei skipped the template instruction referencing: conda init (echo $0 | cut -d'-' -f 2).

In addition, a former RA has noted issues with running template in Mac M1 computers due to issues with Python versions in conda.

The deliverable for this issue will be a revised README for template which provides updated instructions for each of these conflicts, as well as some additional clarifications for users.

Add to template wiki

Miscellaneous items to add:

Expand explanation of external dependencies
Add BibTeX usage

Hunt: 
i. PI who starts writing just writes inline cites as normal, e.g. "Gentzkow and Shapiro (2011)."
ii. RA comes after and converts everything to lyx citation fields and add the .bib file. If we already have a bibtex file with most of the references in it, this is easy, and the PI can even work from this in the initial draft in (i). 
Then there may be iteration if new cites are added, just as in the non-bibtex approach.

I should have said that whether this saves time depends on how good the existing .bib file is. For PhoneAddiction, I think we would start with the (counterfactual) one from SocialMediaEffects and would be in good shape. But if you've got a good .bib file then it's all there.

Matt:
On bibtex: OK got it. Sounds like the workflow is the same as what we've typically done, with the only exception being "RA converts everything to lyx citation fields and adds the .bib file" becomes "RA adds references to the back of the paper." While it's true that that loses the time saving from starting from a .bib file, we found in practice that you get almost the same time savings from starting with the reference list of the previous paper (say SocialMediaEffects) and copying / pasting references as needed. And on the flip side you don't have to pay the cost of converting references to lyx citation fields. My guess is that nets out to be pretty similar overall, or maybe even a win for the non-bibtex version.

 What pushed us away from bibtex was not only the issue of formatting the bibliography for journals, but also some nasty bugs where errors in the .bib file led to references being dropped from the reference list unexpectedly. Perhaps newer versions are more robust in that sense.

Add LyX commenting usage

Within the lyx files: (i) I really like the system of using yellow lyx comments for comments between coauthors to be resolved and deleted, and red latex comments for fact-checking, which never get deleted.

Describe the following scalar export method

1. In the stata code, create a tex file of new commands, with each command corresponding to a scalar you want in the paper.

Program to define a new latex command:
program define latex_nc
local value = `1'
local command "\\newcommand{\\\`1'}{`value'}"
! echo `command' >> "directory/scalars.tex"
end

Remove file in case it already exists:
rm "directory/scalars.tex"

Define and export a scalar:
sum age
scalar meanage = r(mean)
latex_nc meanage


2. The scalars.tex file will look like this:

\newcommand{\meanage}{25} 


3. In the paper: 

    Load the file using \input{directory/scalars.tex}
    Use a command from scalars.tex: $\meanage$

Let me know if you have any questions!

Port over to Hunt's repo when finished.

Update template with real example

Revise the template with a live example that provides a good illustration of how our code works in practice.

Example should ideally:

Illustrate all the key steps we'd have in a typical project
Provide best practice templates for code files, paper / slides, etc.
Be as simple as possible subject to achieving (1) and (2)
Be readable & fun

I would suggest that we base the example on the "tv & potato chips" example in code & data. The /data/ and /analysis/ subdirectories can follow roughly the structure shown on p. 16 of the guide. This will provide a nice link between the two. Don't worry about following what's in code & data religiously. After we lock in the template we can do a round of revision on code & data, updating it to match what we develop here.

I would also suggest that we make the base template use only Python and Latex (no Stata / R / Lyx). We should write pretty Python code using the standard data science tools (numpy, pandas, matplotlib, etc.). This is a good time to invest in establishing a Python code template that we like and will want to follow; you might browse some of the big tutorial sites like software carpentry, etc. and choose a good model to follow.

We will then open a new task to create a top level directory called /extensions/ which contains subdirectories called /stata/, /r/, /lyx/, etc. Each of these will be an example module showing best practice for the given software tool and also including any additional setup files, documentation, etc. that we need for that tool.

Minor note: Let's skip the step in code & data of having the raw data in Excel format. Let's put the raw data directly in csv format instead.

Add resources for learning R and Python to Wiki

@gentzkow I propose we add Datacamp (R and Python) and R4DS (R) to the Setup page. I've personally found both of them very useful, as have many of the newer RAs in the other labs in which I've worked.

Review current template

Clone & run the template
Look over documentation
Let me know anything that breaks or is unclear
Set up a time to discuss how we could make it simpler, more intuitive, or more robust

Update to Python 3.10

The goal of this issue is to update the Python version used in master. The steps are the following:

Build the conda environment with Python 3.10. Per https://github.com/gentzkow/CommitFlex/issues/114#issuecomment-1270855088, the Python version will be commented out in the setup/conda_env.yaml file, so that the norm is that the user updates to the latest version. Update the readme file to reflect current setup.
Run the repository with the updated conda environment. Report any package conflicts that arise (if any).

cc. @gentzkow @snairdesai

Issue with "Use this template" and Git LFS integration

The purpose of this issue (#78) is to address issues with the integrations of git-lfs raw files and the ability to use ~\gentzkow\template directly in template format. I've run into issues using this repository as a template for other independent projects, because the git-lfs files hosted on the repo are not properly tracked to any new repo initialized with this ~/template skeleton. Ngoc ran into the same issue when she was onboarding, as did BW when using ~/template to initialize another project for the team.

We've determined that forking ~/template to a new project allows for the transfer of git-lfs files, but ideally we would correct this integration element to enable the "Use as template" procedure. GitHub suggests this might not be easily solvable (see screenshot below), but it is worth some investigation to see if there are workarounds.

Screenshot from GitHub Docs

@jc-cisneros and I are assigned here.

Simplify make.py

I thought this might be a good time to take a look at the structure of the make.py scripts and see if there's any way we can streamline them.

Here are some comments / questions

The "Load GSLab Make" section is awkward. I don't remember what caused us to have such a complicated loading step here. It would be great if we could just say import gslab_make at the top.
"Check if running from root to check conda status" is a little obscure. Can we clarify what we're doing here?
I don't like "Uncomment for Stata scripts" on line 49. It looks like this is anticipating a situation where we might want to add a temp directory at the top level of the module for outputs we don't want to store/commit (something we do often for Stata scripts). But if we're not including that directory in the template better to just delete this. I wonder if it wouldn't be better practice actually to make that directory output/temp so the make.py script doesn't have to be modified
Line 52 should say something like "MAKE LINKS TO INPUT AND EXTERNAL FILES"

Any other improvements you can suggest?

Minor change in repository + full run check

In this issue we will move the Julia environment setup script (julia_conda_env.jl) to the setup module. As part of this issue, we will complete a full run of the repository to make sure everything is still working as expected.

Goals of the issue:

Move julia_conda_env.jl to setup
Build both the conda and the Julia environment.
Test full run of the repository (including the extra Julia script)

cc. @gentzkow @snairdesai

Discuss improvements to the template

The goal of this issue is to discuss improvements that can be made to the template going forward. Such improvements would in particular target smoother adaptations of repositories to specific needs of replication packages.

Indeed, in creating the replication package for Phone Addiction, we face several issues and bugs as we were trying to adapt the environment to be run within Docker.

Here are some take-aways from that process:

(1) We should make sure we're keeping Python/R dependencies to a minimum.

(2) We should disable functionality in the scripts that is interacting w/ Git (e.g., in defining ROOT and reporting repo characteristics for logging).

(3) We should test in the cloud environment or somewhere similar to make sure we're not relying on any idiosyncrasies of our own machines' setups.

(4) We should continue to explore using Docker.

Here are some further comments by Lars :

I think the fundamental issue that you have (as far as I can tell) already partially addressed is the idea of "exporting" your internal setup. While for instance using Git to check for ROOT is probably a MUST internally (to prevent accidentally not using Git), de-activating it, or having a fallback, might be useful for post-export.
There are some subtleties about Docker that remain bothersome. While I managed to fix it, not sure that's the right approach (why would I need to use two separate YAML files with the exact same content as your single one? That points to some problem with Conda's solver, but it seems screwy.)

I would still encourage you to explore the following:

Use the cloud resources, when possible, to assess robustness

for instance, using Github workflows

Figuring out how those can harness confidential data as well

Github workflows and codespaces allow for private keys - see my example with a Stata license - which can also be used to access Dropbox or Box APIs where the confidential data might be stored

those can be, in your environment, easily ported to a person-specific setup (the $HOME/.gslab/config file... 😉 ) but also by having simple branches (if Dropbox is local, use direct access, if not, use API)

I think the current Docker works, and can be reliably re-created. You might want to investigate how it differs from your project specific setup (I avoided the issue on how to activate an environment within Docker, since that would be two isolating environments when 1 is sufficient). I recreated it (and pushed it to Docker Hub) this morning, with the "one last run through" approach.

Discuss file versioning best practices

The goal of this issue is to decide on suggested best practices for implementing file versioning when using this template.

Updates to R dependency installations in Conda

@gentzkow @szahedian

Following from #53 and #54, @jc-cisneros and I were playing around with the gentzkow/template format to figure out how to install both R and Python dependencies in the same conda environment in a timely fashion. We had previously removed R dependencies altogether following comments (here and here) from the AEA Data Editor that the conda environment was not solving when /setup/conda_env.yaml had both R and Python dependencies listed. The fix below should resolve this:

Prior to running conda env create -f setup/conda_env.yaml, the user should run conda config --set channel_priority strict. The user can then proceed with conda activate <project_name>, and the rest of the template instructions. More on what this is doing under the hood here.
- Note: While this is not required, it is considered best practice to prefix any installations of R software (not packages, but the platform itself), with r-base=<version_name>. This change can easily be made in /setup/conda_env.yaml.

The main benefits of this approach are as follows:

Assuming all standard package dependencies are installable via conda-forge, we no longer need to run a separate script to pull dependencies from /extensions/R. We will still need to do so for STATA.
We do not need to create distinct conda environments for R and Python packages - both can be hosted in the same conda_env.yaml file.
Individuals running conda on MacOS for replication should no longer have major issues solving their environments if both Python and R dependencies are listed.
The above fix also solves part of #56 on Docker usage (in particular this comment). If we integrate this fix within our Docker container builds, individuals across OS systems should no longer have major issues solving their environments if both Python and R dependencies are listed. @jc-cisneros is working on implementing this within Docker currently, and will shortly post an update to #56.

As an example of computational efficiency, I ran this using all of the package dependencies from /setup/conda_env.yaml in Ad Price Drivers, and the process had completed within 5 minutes. Without this fix, we could not solve the environment for /gentzkow/template (with far fewer R dependencies) within 40 minutes.

If this approach is acceptable, @jc-cisneros and I can make revisions to the template instructions in the same file as in gslab-econ/ra_manual #18 and issue a PR here.

Set up base structure for template

Slides for Stanford Talk

A talk at the Stanford brown bag is scheduled for early December. Prepare slides for this.

Review dependencies

Take a quick look at our conda_env.yaml and flag any dependencies you think we could omit (either in the sense that they're not being used at all or in the sense that we could get rid of them and tweak the other code without much loss.

Review current template

Start by reading through the draft RA manual on the wiki here.

Then look over the template code structure and make sure you understand how the pieces fit together.

Review current implementation

Review our current implementation of the template + gslab_make, and discuss next steps.

The goal is to figure out how to make the template as simple, intuitive, and robust / bug proof as possible.

Build fully featured Docker container

Following #54 (comment), in this issue we will

Compile a list of all non-conda dependencies that could be required for current or future projects. Off the top of my head, this would include Stata and LyX.

Write a Dockerfile that builds an environment supporting conda and non-conda dependencies. We have experience with this in #43.

The main improvement over #43 is the addition of Stata, using ideas from AEA Stata for Docker.

Include gslab_make as a Git submodule

In this issue, we will move to including lib/gslab_make as a Git submodule. We currently do not track which version of gslab_make is used in this template; including gslab_make as a Git submodule will allow us to consistently track the version of gslab_make which is being used, and easily pull gslab_make changes into projects.

This issues is based on this comment on incorporating fixes for gslab_make into other repositories.

Add conda activate step before check_setup.py in README

Minor: Current instructions in the README direct the user to run check_setup.py before running conda activate. I would suggest we add the conda activate command to step (3) under Setup.

If you agree, please implement.

Review potential improvements from SocialMediaEffects data review

Review potential improvements
Add to RA manual: No hard-coding values in code
Add to RA manual: Export figures with specifications
Add to RA manual: Delinking from git/Github procedure
Investigate docker

Test closing issue w/ pull request

Evaluate workflow where:

We change pull request naming convention to omit "Pull request for #XX"
Final comment on issue is made in pull request
Issue is closed from within pull request

Adding Julia extension instructions to template

The purpose of this issue (#76) is to provide additional user documentation in README.md on our Julia extension, which was integrated within gentzkow/template in #67. I noticed when pulling these changes to our selective-exposure project that the README.md file does not currently have instructions on running Julia scripts or initializing the Julia environment within miniconda. This process differs slightly from those we use for other softwares, because Julia is not yet fully integrated within conda-forge. I will shortly post a proposed revision to the README.md file describing this procedure.

cc @jc-cisneros @gentzkow

Investigate error in `check_setup.py` when freezing dependencies

@zkashner flagged that when he freezes a dependency version in the conda_env.yaml file, the current check_setup.py script fails to parse the names of the packages correctly. The goal of this issue is to fix that bug.

The steps to be taken are the following:

Reproduce the error described above and post it on this issue.
Modify check_setup.py to be robust to this scenario. @zkashner mentioned he can share a potential solution for this problem.

Thanks @zkashner for flagging this!

Review implementation of template for SocialMediaEffects

Companion of gslab_make #18

Debrief what what we've learned from implementing gslab_make and template for SocialMediaEffects
Make improvements accordingly

Testing Docker integration

Following #43 and #47, this issue will test the portability of docker build across multiple OS. Tentatively, this should include ARM Mac, x86 Mac, Windows, and Linux machines, per this comment.

Conda version and packages conflicts

Hello @gentzkow
In updating the ad-price-drivers to this template, I noticed that the user may run into several issues due to the multiplicity of packages needed in the conda environment.

Let me be more specific.
In the presence of many packages (such as R packages), the common dependencies needed for different packages may run into conflicts in the conda environment. There is ongoing documentation of this error (see here or here for instance).
It turns out that the only things that works in such a case is to downgrade conda, as has been suggested in some comments.

I was wondering whether we would want to add this to the ReadMe? F
or instance, we could suggest that if a user intends on having several packages in their environment, then it is advised to downgrade Conda by running conda install conda=4.6.14.

This is just a suggestion, and of course looking forward to hearing your opinion on this.

Tagging @szahedian as we discussed this earlier today!

Fix dependency issue between rlang and tidyverse in template

The purpose of this issue is to fix a potential dependency issue in our conda environment when building template with R.

Origin + Description of Bug

While compiling another lab repository with R scripts, I found a bug in template arising from the conda environment build (@jc-cisneros also confirmed this on his end). See the error message below from my local console session:

Error message:

This error is not thrown for other packages loaded with conda-forge (i.e., lubridate or data.table), and the repository in question was successfully compiling less than a month ago.

Theory + Next Steps

@jc-cisneros and I will investigate this further. It seems that both ggplot2() and tidyverse() are up to date with CRAN in our conda environment. Our sense is that this does not have to do with the conda build itself, but might have to do with an update to rlang, which relies on base R.

When we freeze R at version 4.0 in the conda environment, rlang is version 1.0.6.
When we do not freeze R at a particular version, we default to the latest (as of now, R == 4.2; rlang == 1.1.1).

The error is likely being thrown by a dependency from rlang which tidyverse relies upon, which is no longer being supported in 4.2. A patch fix is to freeze R at 4.0, but we want to ensure our setup works for the latest versions.

cc @gentzkow

Update documentation and release v0.1

Continuation of #10 (see here).

Update README for template and release as Beta v0.1.

Log software + packages versions

In this issue we will test the write_version_log feature from gslab_make (see gslab-econ/gslab_make#56).

Task list

Add the new feature to the make scripts in each submodule.
Check that a versions.log file is created in each submodule's log folder.
Check that versions.log correctly prints the versions from the active environment.
Robustness tests (i.e., the script should produce a result for all possible user behavior and throw appropriate error messages when relevant)

Stata integration for Docker

In #45, we did proof of concept of replication in Docker. In this issue, we will extend the basic Docker build to support Stata.

Plan external data versioning for completed projects

The goal of this issue is to investigate options and decide on suggested best practices for archiving the specific version of external files used at key points in a project. Some possibilities for implementation include:

creating a .zip file storing the particular version of files used for say the accepted version of a paper, to be stored on Dropbox/Oak/Github
any file versioning/release features that may be available within Dropbox, if they exist

Successful completion of this issue would include suggested best practices that would ensure replicability of project's analysis at a key point in time, while minimizing data storage costs to the extent possible.

Review gslab_make and gslab_fill Python libraries

I am working on completing the first draft of a revised template and RA manual that, as I mentioned, I hope to work on together.

One of the tasks will be for you to revise some of our older Python libraries to provide the tools we need for the template in the simplest, clearest, and most robust form possible. The basis for these will be the gslab_make and gslab_fill modules in the gslab_python repository. (No need to dive in to the gslab_scons module).

In this issue, I'd like you to take some time to familiarize yourself with these libraries so you will be ready to work on them efficiently.

In case you're curious, the code section of this repository has my work in progress on the template. It's still partial and incomplete, so you shouldn't spend any significant time on it until I am done.

Review and test template

Hi all,

The goal of this issue is for all assignees to:

Read through the RA manual
Check out a clean copy of template
Following the RA manual and setup instructions in README.md, build the repository from start to finish
Test template and attempt to break it; report any bugs or errors
Provide feedback and identify room for improvement

Note that the current example scripts are pretty lacking and one of the items on our todos is to flesh them out to be more comprehensive. Let me know if you have any ideas here!