Code Monkey home page Code Monkey logo

template's Introduction

README(Edited By Shiqi)

Table of Contents


Requirements

Note: The application requirements and setup instructions outlined below are intended to serve general users. To build the repository as-is, the following applications are required:

You may download the latest versions of each. By default, the Setup instructions below will assume their usage. Note that some of these applications must also be invocable from the command line. See the Command Line Usage section for details on how to set this up. Note that if you wish to run Julia scripts in your repository, you will additionally need to install Julia and set up its command line usage. Julia is currently not required to build the repository as-is. If you are planning to use a conda environment for development (see instructions below), you are not required to have local installations or enable command line usage of Stata, R, Python, or Julia (although this is recommended).

You must set up a personal GitHub account to clone private repositories on which you are a collaborator. For public repositories (such as template), Git will suffice. You may need to set up Homebrew if git and git-lfs are not available on your local computer.

If you are using MacOS, ensure your terminal is operating in bash rather than the default zsh. MacOS users who are running template on an Apple Silicon chip will instead want to use Rosetta as their default terminal. You can find instructions on how to shift from zsh to Rosetta here and here.

WindowsOS users (with Version 10 or higher) will need to switch to bash from PowerShell. To do this, you can run bash from within a PowerShell terminal (you must have installed git first).

Once you have met these OS and application requirements, clone a team repository from GitHub and proceed to Setup.


Setup

  1. Create a config_user.yaml file in the root directory. An example can be found in the /setup directory. If this step is skipped, the default config_user.yaml will be copied over when running check_setup.py below. You might skip this step if you do not want not to specify any external paths, or want to use default executable names. See the User Configuration section below for further details.

  2. Initialize git lfs. From the root of the repository, run:

   git lfs install
   ./setup/lfs_setup.sh
   git lfs pull

This will not affect files that ship with the template (which use the standard git storage). The first command will initialize git lfs for usage. The second command will instruct git lfs to handle files with extensions such as .pdf, .png, etc. The third command will download large files from the remote repository to your local computer, if any exist. See here for more on how to modify your git lfs settings.

Note that it is not required to initialize git lfs to work with the files hosted on template, but it is highly recommended that you initialize git lfs for large file storage by running the script above.

  1. If you already have conda setup on your local machine, feel free to skip this step. If not, this will install a lightweight version of conda that will not interfere with your local Python and R installations.

NOTE: If you do not wish to install conda, proceed to steps 6 - 8 (installing conda is recommended).

Install miniconda to be used to manage the R/Python virtual environment, if you have not already done this. If you have homebrew (which can be download here) miniconda can be installed as follows:

    brew install --cask miniconda

Once you have installed conda, you need to initialize conda by running the following commands and restarting your terminal:

    conda config --set auto_activate_base false
    conda init $(echo $0 | cut -d'-' -f 2)
  1. Next, create a conda environment with the commands:
    conda config --set channel_priority strict
    conda env create -f setup/conda_env.yaml

The default name for the conda environment is template. This can be changed by editing the first line of /setup/conda_env.yaml. To activate the conda virtual environment, run:

    conda activate <project_name>

The conda environment should be active throughout setup, and whenever executing modules within the project in the future. You can deactivate the environment with:

conda deactivate <project_name>

Please ensure that your conda installation is up to date before proceeding. If you experience issues building your conda environment, check the version of your conda installation and update it if needed by running:

conda -V
conda update -n base -c defaults conda

Then, proceed to rebuild the environment.

  1. Fetch gslab_make submodule files. We use a Git submodule to track our gslab_make dependency in the /lib/gslab_make folder. After cloning the repository, you will need to initialize and fetch files for the gslab_make submodule. One way to do this is to run the following bash commands from the root of the repository:
   git submodule init
   git submodule update

Once these commands have run to completion, the /lib/gslab_make folder should be populated with gslab_make. For users with miniconda, proceed to step 7.

  1. For users who do not want to install miniconda, follow the instructions in /setup/dependencies.md to manually download all required dependencies. Ensure you download the correct versions of these packages. Proceed to step 7.

  2. Run the script /setup/check_setup.py. One way to do this is to run the following bash command from the /setup directory (note that you must be in the /setup directory, and you must have local installations of the softwares documented in Requirements. for the script to run successfully):

   python check_setup.py
  1. To build the repository, run the following bash command from the root of repository:

    python run_all.py
    

Adding Packages

Note: These instructions are relevant for users who have installed miniconda. If you have not done so, consult /setup/dependencies.md.

Python

Add any required packages to /setup/conda_env.yaml. If possible add the package version number as well. If there is a package that is not available from conda, add this to the pip section of the yaml file. In order to not re-run the entire environment setup you can download these individual files from conda with the command:

conda install -c conda-forge --name <environment name> <package_name=version_number>

R

Add any required packages that are available via CRAN to /setup/conda_env.yaml. These must be prepended with r-. If there is a package that is only available from GitHub and not from CRAN, add this package to /setup/setup_r.r (after copying this script from /extensions). These individual packages can be added in the same way as Python packages above (with the r- prepend). Note that you may need to install the latest version of conda as outlined in the setup instructions above to properly load packages.

Stata

Install Stata dependencies using /setup/download_stata_ado.do (copy download_stata_ado.do from /extensions to /setup first). We keep all non-base Stata ado files in the lib subdirectory, so most non-base Stata ado files will be versioned. To add additional Stata dependencies, use the following bash command from the setup subdirectory:

stata-mp -e download_stata_ado.do

Julia

First, add any required Julia packages to julia_conda_env.jl. Follow the same steps described in Setup to build and activate your conda environment, being sure to uncomment the line referencing julia in /setup/conda_env.yaml before building the environment. Once the environment is activated, run the following line from the /setup directory:

julia julia_conda_env.jl

Then, ensure any Julia scripts are properly referenced in the relevant make.py scripts with the prefix gs.run_julia, and proceed to run run_all.py.


Command Line Usage

For instructions on how to set up command line usage, refer to the repo wiki.

By default, the repository assumes these executable names for the following applications:

application : executable

python      : python
git-lfs     : git-lfs
lyx         : lyx
r           : Rscript
stata       : stata-mp (this will need to be updated if using a version of Stata that is not Stata-MP)
julia       : julia

Default executable names can be updated in config_user.yaml. For further details, see the User Configuration section.


User Configuration

config_user.yaml contains settings and metadata such as local paths that are specific to an individual user and should not be committed to Git. For this repository, this includes local paths to external dependencies as well as executable names for locally installed software.

Required applications may be set up for command line usage on your computer with a different executable name from the default. If so, specify the correct executable name in config_user.yaml. This configuration step is explained further in the repo wiki.


Running Package Scripts in Other Languages

By default, this template is set up to run Python scripts. The template is, however, capable of running scripts in other languages too (make-scripts are always in Python, but module scripts called by make-scripts can be in other languages).

The directory /extensions includes the code necessary to run the repo with R and Stata scripts. Only code that differs from the default implementation is included. For example, to run the repo using Stata scripts, the following steps need to be taken.

  1. Replace /analysis/make.py with /extensions/stata/analysis/make.py and /data/make.py with /extensions/stata/data/make.py.
  2. Copy contents of /extensions/stata/analysis/code to /analysis/code and contents of /extensions/stata/data/code to /data/code.
  3. Copy .ado dependencies from /extensions/stata/lib/stata to /lib/stata. Included are utilities from the repo gslab_stata.
  4. Copy setup script from /extensions/stata/setup to /setup.

Windows Differences

The instructions in template are applicable to Linux and Mac users. However, with just a few tweaks, this repo can also work on Windows.

If you are using Windows, you may need to run certain bash commands in administrator mode due to permission errors. To do so, open your terminal by right clicking and selecting Run as administrator. To set administrator mode on permanently, refer to the repo wiki.

The executable names are likely to differ on your computer if you are using Windows. Executable names for Windows generally resemble:

application : executable
python      : python
git-lfs     : git-lfs
lyx         : LyX#.# (where #.# refers to the version number)
r           : Rscript
stata       : StataMP-64 (will need to be updated if using a version of Stata that is not Stata-MP or 64-bit)
julia       : julia

To download additional ado files on Windows, you will likely have to adjust this bash command:

stata_executable -e download_stata_ado.do

stata_executable refers to the name of your Stata executable. For example, if your Stata executable was located in C:\Program Files\Stata15\StataMP-64.exe, you would want to use the following bash command:

StataMP-64 -e download_stata_ado.do

License

MIT License

Copyright (c) 2019 Matthew Gentzkow

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

template's People

Contributors

jc-cisneros avatar zkashner avatar gentzkow avatar snairdesai avatar z-y-huang avatar shiqiyang2022 avatar etang21 avatar szahedian avatar houdanait avatar

Forkers

erick11293

template's Issues

Tests new PDF export from CSV

@ShiqiYang2022: Let me know if you see any issues here, or come across any problems. Thanks for testing! Instructions are below:

  1. Clone the template repository, and switch to the branch 85_ditab_format.
  2. Create the config_user.yaml file and initialize git lfs as usual.
  3. Build and activate the conda environment. I had to add a couple of Python packages for the automated export to PDF to compile, so you may need to delete and rebuild the environment if you have an older version stored locally:
    conda env remove -n template
    conda config --set channel_priority strict
    conda env create -f setup/conda_env.yaml
  1. Initialize and update the submodule as usual. Then, navigate to ~/lib/gslab_make and switch to the branch template85_ditab_format.

  2. Execute python run_all.py from root. This will populate the placeholder scalars when running the analysis module, and produce the formatted PDF tables when running paper_slides.

  3. You should see the populated PDFs with the proper formatting now live in ~/paper_slides/output. They are the files prefixed with gs_ (see here).


A few notes:

  • The first time you run this, Excel will open on your computer and may prompt you to enable file access for the folder where you have cloned the repository. You should only need to do this once (the first time you execute run_all.py).
  • After you execute run_all.py for the first time to populate the outputs from the ~/analysis module, you can make desired formatting edits to the Excel sheets in the ~/paper_slides/skeleton folder, and (assuming the link references are also properly updated), the formatting will update if you run only ~/paper_slides.

BLP Draft

Update: tag jms nano

I conducted rounds of testing(thanks snd jc) on Sherlock to determine resources available, and I implemented job runs with different ordering structures and parallel execution strategies. I am therefore posting my notes and proposing my job submission re-structure solution.

Resources Limit

  1. Number of jobs allowed to submit: $2,000$ jobs maximum for each user in gentzkow group account.
  2. Computing power allowed to use: From testing results, the Maximum number of CPUs that can support computing in total for gentzkow group is $2,000$ CPUs.
  3. Computing power available: In brief, nodes(if available) can support computing $2,000$ CPUs(i.e. full capacity) at the same time.
  • Our group now can access Sherlock gentzkow and hns(Humanities and Sciences nodes) nodes. We can also access normal(public) nodes but the average job queued time on those nodes is $13$ days. That's because we submitted huge amount of jobs in the previous several months, resulting incredibly low sshare score(a metric in SLURM's fair-share scheduling) preventing us to use normal nodes.
  • gentzkow nodes have in total $148$ CPUs which are available anytime for group users, and hns nodes have $3,968$ nodes in total, which seemed to be not that busy when I tested last week, and can allocate CPUs to our group to support in total $2,000$ nodes of computing.
  1. Memory limit: most nodes exhibit a ratio of 8 GB of memory per CPU, with configurations of 20 CPUs with 128 GB RAM, 24 CPUs with 192 GB RAM, 32 CPUs with 256 GB RAM, and 128 CPUs with 1024 GB RAM.

Resources needed

In the last full run before NBER submission, we outputted in total $24,087$ RCNL, $35,698$ RCL, and $29,581$ L estimations. Per my investigation, on average one RCNL estimation costs $5$ hours run on a single CPU, while $0.6$ hour for RCL and $0.2$ hour for L. The total amount of hours*CPU of conducting the full run is roughly $24,087 \times 5 + 35,698 \times 0.6 + 29,581 \times 0.2 = 120,435 + 21,418.8 + 5,916.2 = 147,770$ (hours*CPU).

If we run jobs in full capacity, ideally each CPU would have been busy for approximately $\frac{147,770}{2,000} \approx 73.885$ hours.

Proposed Solution

My proposed solution would be combining 4 jobs in previous submit_jobs.py into 1 new job. For each new job, we run the 4 previous jobs in parallel using parpool() function, and assign 1 CPU to each job(so 4 CPU for each new job). The strength of this new approach are:

  1. Substantially cut the number of jobs submitted, and one lab-member can submit the full run job in only one round submission.
  • We previously have $5$ estimations within one RCNL job, $25$ for RCL and $50$ for L,
  • In this proposal, we will have $4 \times 5 = 20$ estimations within one RCNL job, $100$ for RCL and $200$ for L.
  • The total number of jobs is: (RCNL + RCL +L) $\frac{24,087}{20} + \frac{35,698}{100} + \frac{29,581}{200} = 1,204.35 + 356.98 + 147.905 \approx 1,710 &lt; 2,000$
  1. Get full use of CPU capacity continuously.
  • In previous runs, we didn't fully utilize the CPU capacities(using $\approx 1,200$ CPUs for each round of run), and we didn't full-run jobs continuously because we needed to collaborate with lab members and submit jobs in batches, after the job submitted, they needs to wait in queue.

  • For this proposal, the user submits all $1,710$ jobs at once. Because each job requires the allocation of 4 CPUs, while the maximum number of CPUs available for computing in total for the group is limited to $2,000$. Therefore, a maximum of $500$ jobs can run on Sherlock at any given time, and the remaining jobs will be queued. As soon as some jobs ended so that there are available CPU slots, these queued jobs will automatically fill in those slots.

  1. Reduce the time of full run. I expect for this proposal, the job can be run within 1 week. I am not sure whether 1 week is too confident because I have not tested the proposal on the full run yet, but if in full capacity each CPU need to work for $73.885$ hours, 1 week estimation seems not that ambitious for me, suppose we can continuously run estimations in full gear.

I tested the parallelized job submission structure on RCNL estimations of $N = 1$ - $100$ simulated data and I confirmed that strength 1 and 2 of this new approach mentioned earlier hold true for my test case.

The only issue might be worth concerning is, we might have scrambled .out file(job running log file) because of the parallelized run, but I think we can always add some "light run" version of estimation(say, 5 simulations) that outputs the .out file in non-parallelized way.

Next Steps

  1. If the new proposal sounds reasonable to JMS NB, I plan to test the full run this week.
  2. Determine the run time in more granular level. Currently the running time within same model specification also varies. For instance some RCNL jobs tooks more than 9 hours to output one estimation. This is due to the usage of different combinations, and different tolerance level, by investigating and specify those we can improve the job submission efficiency.
  3. Determine the memory of RCNL job. Previously we applied for 20GB per RCNL estimation, but on sherlock most nodes exhibit a ratio of 8 GB of memory per CPU. @ NB, do you think 20 GB is necessary for RCNL or we can cut the requested memory a bit?
  4. (In longer run)I feel that our restructuring efforts haven't altered the total runtime of job execution; we've simply re-arranged the job execution structure. In the long term, I believe we might still need to improve the speed of the residualizing code within Matlab, because the jobs submission structure in the new proposal is already close to the threshold of what one person can submit as jobs for all resource dimensions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.