science-data-science's Introduction

A manifesto for agile (data) science

Parentheses indicate that pretty much all science is data science, since management of data is essential and integral to the labor of scientists.

To fix the reproducibility and replicability crisis of modern science, and reduce the anxiety and bad career prospects of researchers, science needs to follow the path of software development and become agile.

Here is a paper and a presentation to justify a set of new hypotheses that would improve the situation of science. They're open to suggestions, just open a pull request.

Hypotheses

Reproducibility over publishability
Testing at all levels over hypotheses proved once
Open over closed
Stakeholder collaboration over vertical chains-of-command

Sign the manifesto

If you agree with these principles, add yourself to the list of adherents via a pull request.

science-data-science's People

Contributors

Watchers

science-data-science's Issues

Update the state of the art in agile data science

Update to 2022 with some bibliography

Add some fixes

Write a possible path forward

Add an abstract

Write a small report with the essential argumental line

Write an introduction with the argumental line of the presentation

Open to collaborations?

Hi @JJ, congrats for putting together this excellent manifesto. This really resonated with my own story and how I see the future of science. I'm wondering if you'd be open for contributions. Let me give you a bit of my background first and how I also lived how Agile could help science.

I started my PhD in 2012 (it was indeed in the ML area) and around that time my research group was using SVN (this was considered advanced even in Engineering faculties) and collaborating in papers via keeping the folders with the LaTeX source in Dropbox. A few of us that were familiar with the growth of the Agile mindset started suggesting git, using kanban boards and by the time I finished my PhD in 2015 we were using Docker to ship the code to run our experiments and even thinking about publishing not just the code and the datasets, but the docker images themselves. We never did as the baby steps in the community, as you mentioned with NeurIPS (NIPS back then) were to just provide a link for the repo and to open source your dataset if you had a new one.

Given an achieved reproducibility of some of the conditions of the experiment (code with the method and environment), what about the data? This was a massive headache for us as sometimes the results are generated using specific sets for train, validation and test datasets, and even worse, sometimes data evolves with time (e.g. weather, astronomical observations, etc.). At that point I finished my PhD and because many of my experiments were in computer vision, by just providing a static version of the train+validation+test sets it was enough for data reproducibility.

But then I jumped to industry and the explosion of tooling and the benefits of the Agile mindset in data science teams was evident. Some of the practices (like sprints) were still hard to adopt as there's still a research aspect which is almost impossible to timebox. But extensive testing, automation, frequent standups to untangle blockers, letting the MVPs to drive the pace, etc. these are all practices that the academic science was lacking that could really have a massive impact in it.

Finally, I think adopting a more open and agile mindsets in academia are crucial to bring other sources of funding for doing fundamental research. By being paper-centred (worse if it's in a closed-door journal with 1.5 years of decision turnaround) aspiring and junior researchers will see that the real innovation is happening outside universities and research centres and happening, in an iterative, fail-fast way in industry, the way this has happened traditionally in Medicine.

Apologies for the long post but wanted to give you a bit of background and ask if I could collaborate with a few humble comments and typos I've noticed and I really hope this can get some traction via arXiv or some other tool.

Recommend Projects