shiqiyang2022 / template Goto Github PK

This project forked from gentzkow/template_archive

Shell 0.33% Python 24.96% R 4.78% Julia 1.96% TeX 61.42% Stata 6.55%

template's Issues

Tests new PDF export from CSV

@ShiqiYang2022: Let me know if you see any issues here, or come across any problems. Thanks for testing! Instructions are below:

Clone the template repository, and switch to the branch 85_ditab_format.
Create the config_user.yaml file and initialize git lfs as usual.
Build and activate the conda environment. I had to add a couple of Python packages for the automated export to PDF to compile, so you may need to delete and rebuild the environment if you have an older version stored locally:

    conda env remove -n template
    conda config --set channel_priority strict
    conda env create -f setup/conda_env.yaml

Initialize and update the submodule as usual. Then, navigate to ~/lib/gslab_make and switch to the branch template85_ditab_format.
Execute python run_all.py from root. This will populate the placeholder scalars when running the analysis module, and produce the formatted PDF tables when running paper_slides.
You should see the populated PDFs with the proper formatting now live in ~/paper_slides/output. They are the files prefixed with gs_ (see here).

A few notes:

The first time you run this, Excel will open on your computer and may prompt you to enable file access for the folder where you have cloned the repository. You should only need to do this once (the first time you execute run_all.py).
After you execute run_all.py for the first time to populate the outputs from the ~/analysis module, you can make desired formatting edits to the Excel sheets in the ~/paper_slides/skeleton folder, and (assuming the link references are also properly updated), the formatting will update if you run only ~/paper_slides.

BLP revise job submission issue wording

Thanks chiefs in advance helping me revise the issue wording ❤️ .

Add detailed workflow into instruction

Follow gentzkow#84 (comment). Add detailed workflow into instruction.

Create repository locally and sync to overleaf

In this issue I just try to create repository locally and sync to overleaf, make edits and push back.

Attempt to proposal gentzkow#84 (comment) in gentzkow#84.

Cluster standard errors at the county-level

Step 9 for Practice task on RA Manual (Sherlock Extension).

I conducted rounds of testing(thanks snd jc) on Sherlock to determine resources available, and I implemented job runs with different ordering structures and parallel execution strategies. I am therefore posting my notes and proposing my job submission re-structure solution.

Resources Limit

Number of jobs allowed to submit: $2,000$ jobs maximum for each user in gentzkow group account.
Computing power allowed to use: From testing results, the Maximum number of CPUs that can support computing in total for gentzkow group is $2,000$ CPUs.
Computing power available: In brief, nodes(if available) can support computing $2,000$ CPUs(i.e. full capacity) at the same time.

Our group now can access Sherlock gentzkow and hns(Humanities and Sciences nodes) nodes. We can also access normal(public) nodes but the average job queued time on those nodes is $13$ days. That's because we submitted huge amount of jobs in the previous several months, resulting incredibly low sshare score(a metric in SLURM's fair-share scheduling) preventing us to use normal nodes.
gentzkow nodes have in total $148$ CPUs which are available anytime for group users, and hns nodes have $3,968$ nodes in total, which seemed to be not that busy when I tested last week, and can allocate CPUs to our group to support in total $2,000$ nodes of computing.

Memory limit: most nodes exhibit a ratio of 8 GB of memory per CPU, with configurations of 20 CPUs with 128 GB RAM, 24 CPUs with 192 GB RAM, 32 CPUs with 256 GB RAM, and 128 CPUs with 1024 GB RAM.

Resources needed

In the last full run before NBER submission, we outputted in total $24,087$ RCNL, $35,698$ RCL, and $29,581$ L estimations. Per my investigation, on average one RCNL estimation costs $5$ hours run on a single CPU, while $0.6$ hour for RCL and $0.2$ hour for L. The total amount of hours*CPU of conducting the full run is roughly $24,087 \times 5 + 35,698 \times 0.6 + 29,581 \times 0.2 = 120,435 + 21,418.8 + 5,916.2 = 147,770$ (hours*CPU).

If we run jobs in full capacity, ideally each CPU would have been busy for approximately $\frac{147,770}{2,000} \approx 73.885$ hours.

Proposed Solution

My proposed solution would be combining 4 jobs in previous submit_jobs.py into 1 new job. For each new job, we run the 4 previous jobs in parallel using parpool() function, and assign 1 CPU to each job(so 4 CPU for each new job). The strength of this new approach are:

Substantially cut the number of jobs submitted, and one lab-member can submit the full run job in only one round submission.

We previously have $5$ estimations within one RCNL job, $25$ for RCL and $50$ for L,
In this proposal, we will have $4 \times 5 = 20$ estimations within one RCNL job, $100$ for RCL and $200$ for L.
The total number of jobs is: (RCNL + RCL +L) $\frac{24,087}{20} + \frac{35,698}{100} + \frac{29,581}{200} = 1,204.35 + 356.98 + 147.905 \approx 1,710 < 2,000$

Get full use of CPU capacity continuously.

In previous runs, we didn't fully utilize the CPU capacities(using $\approx 1,200$ CPUs for each round of run), and we didn't full-run jobs continuously because we needed to collaborate with lab members and submit jobs in batches, after the job submitted, they needs to wait in queue.
For this proposal, the user submits all $1,710$ jobs at once. Because each job requires the allocation of 4 CPUs, while the maximum number of CPUs available for computing in total for the group is limited to $2,000$. Therefore, a maximum of $500$ jobs can run on Sherlock at any given time, and the remaining jobs will be queued. As soon as some jobs ended so that there are available CPU slots, these queued jobs will automatically fill in those slots.

Reduce the time of full run. I expect for this proposal, the job can be run within 1 week. I am not sure whether 1 week is too confident because I have not tested the proposal on the full run yet, but if in full capacity each CPU need to work for $73.885$ hours, 1 week estimation seems not that ambitious for me, suppose we can continuously run estimations in full gear.

I tested the parallelized job submission structure on RCNL estimations of $N = 1$ - $100$ simulated data and I confirmed that strength 1 and 2 of this new approach mentioned earlier hold true for my test case.

The only issue might be worth concerning is, we might have scrambled .out file(job running log file) because of the parallelized run, but I think we can always add some "light run" version of estimation(say, 5 simulations) that outputs the .out file in non-parallelized way.

Next Steps

If the new proposal sounds reasonable to JMS NB, I plan to test the full run this week.
Determine the run time in more granular level. Currently the running time within same model specification also varies. For instance some RCNL jobs tooks more than 9 hours to output one estimation. This is due to the usage of different combinations, and different tolerance level, by investigating and specify those we can improve the job submission efficiency.
Determine the memory of RCNL job. Previously we applied for 20GB per RCNL estimation, but on sherlock most nodes exhibit a ratio of 8 GB of memory per CPU. @ NB, do you think 20 GB is necessary for RCNL or we can cut the requested memory a bit?
(In longer run)I feel that our restructuring efforts haven't altered the total runtime of job execution; we've simply re-arranged the job execution structure. In the long term, I believe we might still need to improve the speed of the residualizing code within Matlab, because the jobs submission structure in the new proposal is already close to the threshold of what one person can submit as jobs for all resource dimensions.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.