Code Monkey home page Code Monkey logo

Comments (19)

z-y-huang avatar z-y-huang commented on September 2, 2024

Hi @gentzkow:

To follow up on this comment here, I've synced any changes to gslab_make with template and updated the RA manual in the wiki.

Let me know if there are any changes you would like me to make before opening an issue to have the other RAs review the template.

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

Thanks @z-y-huang. I've read over the wiki and made some minor edits.

Here are some additional to dos I'd suggest we knock off before sending for review.

  1. I realize we have not included /temp/ directories in the template. These can be pretty important in my experience as a place to store intermediate files that are only ever needed within the module. I added discussion of these to the repository structure wiki page. Why don't you also add /temp/ to the template's .gitignore. When we create the live working examples we should make sure and include one that uses the /temp/ directory.

  2. I would like to allow code to reference /raw/ directories directly. I removed the note in the manual about only referencing these via /input/. I'd appreciate it if you could update the template to reflect this (or let me know if you disagree with the choice).

  3. I think we should implement in the template the decision we made in SocialMediaEffects to copy and commit inputs for paper_slides to the repo rather than making them links. I'd appreciate it if you could implement this in the template and update the wiki to explain it.

  4. We need a more detailed discussion of the protocol for releasing output to an external directory (e.g., like we do with output_local). Let's make this a separate section of the repository structure wiki page and provide more detailed instructions for how exactly this works. Then when we discuss output_local we can just reference this section.

  5. We need to do some work on readme.md. This should focus on steps needed to setup the repository and get it running. It should refer via a link to the wiki for instructions on using the repository (e.g., how to run the build scripts). We should try to get rid of anything that is redundant with the wiki.

    I would also suggest that we include separate sections at the end called "Specific Instructions for Mac" and "Specific Instructions for Windows" that collect everything platform specific. The main section on command line usage can be very short and just give the list of commands that should be able to run the software by default, then say how to change the names for executables, then point to the mac / windows sections for more details on setup.

from template_archive.

z-y-huang avatar z-y-huang commented on September 2, 2024

@gentzkow:

Got it.

It sounds like from the emails that we want to support Jupyter Notebooks, which I just added with a run function called run_jupyter. The run_jupyter function is currently executing notebooks via the command line, though we can also change it to be a wrapper for the Python API interface. The advantage of executing via command line is that it keeps the codebase simple as run_jupyter will mirror the other run functions. The advantage of executing via Python API interface is that it's one less thing users have to set up command line usage for. Do you have a preference either way?

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

from template_archive.

z-y-huang avatar z-y-huang commented on September 2, 2024

@gentzkow:

I believe we can add Jupyter to the requirements.txt and preface the jupyter command with python -m. I'll need to double-check this won't potentially lead to conflicts with Anaconda (since that's how most people have Jupyter installed), but will go with that if there isn't.

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

from template_archive.

z-y-huang avatar z-y-huang commented on September 2, 2024

Hi @gentzkow:

I pushed a branch addressing the items here.

Four comments:

  1. My original reasoning behind imposing the rule that /code/ only ever reference /input/ (and not /raw/) was that doing so makes it more transparent what source files are used for analysis (since often times, /raw/ contains files that aren't used).

  2. The current procedure for saving external outputs requires manually uploading the output externally and saving the appropriate documentation. This seems like it could lead to room for error. One thought was to write an optional utility function that automatically uploads the output and saves the appropriate documentation.

  3. I ended up using the Python API interface for Jupyter notebook since it's more robust when it comes to Python version compatibility. The API lets you run notebooks created with Python 2 using Python 3 and vice-versa.

  4. From the testing I did, gslab_make should play nicely with Anaconda and run on different IDEs. One quibble is that Anaconda creates environments where the PATH variable may be different from your system environment. For example, Matlab is added to my PATH for my system environment but not my Anaconda environment, which means run_matlab works when running it via shell but not through Spyder. This is an idiosyncratic issue though, but something I can note in the RA manual.

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

Great.

On (1), I understand that reasoning. I think the rule we want to stick to is just that code can never reference a directory outside its own module.

I continue to go back and forth between these two structures we've discussed before

/data/ with subdirectory /raw/
/analysis/
etc.

or

/raw/
/data/
/analysis/
etc.

The former would correspond to a case where /raw/ is consumed only by code within /data/ and then (most) everything downstream consumes /data/output/ but not /data/raw/. The latter would correspond to a case where many different modules read from /raw/.

An annoying detail is what to do about things like image files we want to use only for slides. Current plan would be those go in /paper_slides/raw/. In the second scheme, are we OK having /raw/ both as a standalone module and as a sub within other modules? Or do we want to force those image files to all be in the top-level /raw/ directory.

Let me know your views.

On (2), I like the idea of a utility script but let's sequence that later after we finish tidying up and do the next round of review. For now let's just make sure the manual instructions are clear.

Good on (3)-(4).

One note: on the hard-committed input files for paper_slides what we currently have doesn't work. I still can't compile paper.lyx without running make.py first. I think the input files were committed as symbolic links rather than actual files.

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

Also, this looks like a typo in setup_repository.py...

$ python setup_repository.py
Traceback (most recent call last):
  File "setup_repository.py", line 83, in <module>
    configuration()
  File "setup_repository.py", line 78, in configuration
    check_software(config, config_user)
  File "setup_repository.py", line 59, in check_software
    check_executable(default_executabless[os.name]['git-lfs'])
NameError: global name 'default_executabless' is not defined

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

Can we rename setup_repository.py -> check_setup.py?

That seems more accurate if the script is not doing any actual setup just checking if it's correct.

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

I get the following error even when I run make.py in /analysis/, where there are no files >0.5mb.

WARNING! Certain files tracked by git exceed config limit (0.5 MB). See logs for more detail.

from template_archive.

z-y-huang avatar z-y-huang commented on September 2, 2024

@gentzkow:

My mistake, I didn't push a bugfix. I fixed the typo and committed the hard files in paper_slides.

The size check function is currently repository-wide, but I can make it only walk through the module if that's what we want.

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

Ah, I see. Yes, I think it would make sense to have the size check function be module-specific. It would also be great if it could list the offending files.

Do you know why I would have been getting that warning on a clean checkout of the template? Presumably it should be set up so all large files are tracked w/ LFS.

from template_archive.

z-y-huang avatar z-y-huang commented on September 2, 2024

@gentzkow:

The list of offending files are saved in make.log (I can reword my warning message to make that clearer). I didn't output the file list into the console because it seemed like it could get annoying if you intentionally wanted to keep those files.

The data-generating example script is saving TXTs instead of a CSVs, which we don't have added to LFS. It looks like gslab_make also has a tar file for testing, which I can start adding to LFS.

from template_archive.

z-y-huang avatar z-y-huang commented on September 2, 2024

@gentzkow:

I personally like the idea of a centralized /raw/ folder with a /general/ subdirectory for any files that are used across multiple repositories and /module/ subdirectories for files that are only used in a specific module. There's a small cost to having to create input links/copies in all of the modules, but I think a single place for all of the data is useful for cases where we may want to release the code for a repository but not any of the data.

I think either way, we do want a clear distinction between files that are used only for a specific module vs. files that are used across multiple modules. That way, the dependencies are clearly outlined and it's much less likely that someone forgets to run all of the modules that rely on a file.

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

Great. Thanks.

On the large files, saving in make.log is fine. Let's update the warning message to point the user to make.log for the list. Let's switch the data-generating example so it saves .csv rather than .txt and add .tar to LFS. (We might want to make a note to clean up the previously committed versions of these at some point so the repository isn't heavier-weight than necessary; will matter since this repo will be a base for all other repos.)

On raw, let's go to a centralized /raw/ directory then. I don't think we need separate /module/ and /general/ subs -- if all other modules call from /raw/ using inputs.txt the record of which files are used where will be clear.

There is a general question of how we look at dependencies downstream. inputs.txt makes it easy to ask which other files F module M depends on. But it is hard to ask which modules M depend on some arbitrary file F. In the past we'd written a repo mapping tool that would crawl the inputs.txt and externals.txt files and make a nice network map of the relationships. Once we finish the first round of review we can think about creating an updated version of that.

from template_archive.

z-y-huang avatar z-y-huang commented on September 2, 2024

@gentzkow:

I have addressed the comments here and here and have updated all of the documentation accordingly.

I can put (i) cleaning the commit history (the .git folder for a clean checkout is currently about ~4 MB, so not egregious yet), (ii) adding a utility function for uploading external files, and (iii) adding a utility function for mapping dependencies on the long-term todo list.

from template_archive.

gentzkow avatar gentzkow commented on September 2, 2024

Great! I think we're basically done and ready to go to review.

The only small thing I'd suggest is that we store some actual raw data in /raw/ that is then consumed by /data/ (rather than having the code in /data/ generate the data itself).

We might as well make the example code a little more live by having a raw data set in /raw/ then doing some transformations of it in /data/.

When we send for review we should make clear to everyone that it's next on the to do list to replace all of this with a more interesting / rich set of example scripts.

Oh, and also: I think I'm going to backtrack on my suggestion to have the .ipynb code as part of the example code in the main repo. It's great that you wrote the run_jupyter function and we definitely will want to put a jupyter notebook directory as one of the add-on examples we include but I think it will be simpler if all the examples are plain vanilla python.

from template_archive.

z-y-huang avatar z-y-huang commented on September 2, 2024

@gentzkow:

I added a dataset in /raw/ and removed the notebook in /analysis/. Closing this issue and will open a new one for review.

from template_archive.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.