Code Monkey home page Code Monkey logo

science-data-science's Introduction

A manifesto for agile (data) science

Parentheses indicate that pretty much all science is data science, since management of data is essential and integral to the labor of scientists.

To fix the reproducibility and replicability crisis of modern science, and reduce the anxiety and bad career prospects of researchers, science needs to follow the path of software development and become agile.

Here is a paper and a presentation to justify a set of new hypotheses that would improve the situation of science. They're open to suggestions, just open a pull request.

Hypotheses

  1. Reproducibility over publishability
  2. Testing at all levels over hypotheses proved once
  3. Open over closed
  4. Stakeholder collaboration over vertical chains-of-command

Sign the manifesto

If you agree with these principles, add yourself to the list of adherents via a pull request.

science-data-science's People

Contributors

jj avatar

Watchers

 avatar Alberto Guillén avatar  avatar

Forkers

jmrr mariosky

science-data-science's Issues

Open to collaborations?

Hi @JJ, congrats for putting together this excellent manifesto. This really resonated with my own story and how I see the future of science. I'm wondering if you'd be open for contributions. Let me give you a bit of my background first and how I also lived how Agile could help science.

I started my PhD in 2012 (it was indeed in the ML area) and around that time my research group was using SVN (this was considered advanced even in Engineering faculties) and collaborating in papers via keeping the folders with the LaTeX source in Dropbox. A few of us that were familiar with the growth of the Agile mindset started suggesting git, using kanban boards and by the time I finished my PhD in 2015 we were using Docker to ship the code to run our experiments and even thinking about publishing not just the code and the datasets, but the docker images themselves. We never did as the baby steps in the community, as you mentioned with NeurIPS (NIPS back then) were to just provide a link for the repo and to open source your dataset if you had a new one.

Given an achieved reproducibility of some of the conditions of the experiment (code with the method and environment), what about the data? This was a massive headache for us as sometimes the results are generated using specific sets for train, validation and test datasets, and even worse, sometimes data evolves with time (e.g. weather, astronomical observations, etc.). At that point I finished my PhD and because many of my experiments were in computer vision, by just providing a static version of the train+validation+test sets it was enough for data reproducibility.

But then I jumped to industry and the explosion of tooling and the benefits of the Agile mindset in data science teams was evident. Some of the practices (like sprints) were still hard to adopt as there's still a research aspect which is almost impossible to timebox. But extensive testing, automation, frequent standups to untangle blockers, letting the MVPs to drive the pace, etc. these are all practices that the academic science was lacking that could really have a massive impact in it.

Finally, I think adopting a more open and agile mindsets in academia are crucial to bring other sources of funding for doing fundamental research. By being paper-centred (worse if it's in a closed-door journal with 1.5 years of decision turnaround) aspiring and junior researchers will see that the real innovation is happening outside universities and research centres and happening, in an iterative, fail-fast way in industry, the way this has happened traditionally in Medicine.

Apologies for the long post but wanted to give you a bit of background and ask if I could collaborate with a few humble comments and typos I've noticed and I really hope this can get some traction via arXiv or some other tool.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.