Code Monkey home page Code Monkey logo

data_science_delivered's People

Contributors

ianozsvald avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data_science_delivered's Issues

Draft a recipe for running a research -> deployment project

What sort of plan can be proposed to layout a successful project from idea through to deployment?

How to derisk? Evaluate value, costs and risks. How to stage it. How to get buy-in. How to demonstrate progress. How to deploy to a non-DS team?

Regression diagnosis notes

  • Add Gradient Boosted Tree Partial Dependence plot - does it agree with my 2 variable exploration with RF?
  • Add more cumulative plots
  • Reorder so the experimental stuff is at the end, the start is about explaining predictions
  • Quantile regression intervals (on RF or GBT?) using fully expanded trees
  • Using LIME and a set of test cases show which features are most important over a regression range - do we see a pattern?
  • Consider adding Shapely
  • Confusion matrix (used binned data) ?
  • Can I extract LIME features to explain what's in the model?
  • Consider Breiman's RF suggestion of randomizing each key feature in turn in the test set, then observing the overall change in the score - which variables are critical? How well does this mirror feature-importance? -> recently added to ELI5 https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html

notes on stuff I should add

  • if you lack constraints on datastores then duplicates will occur
  • how to create setup.py
  • hypothesis can fuzz mysql to make sure the data going in and back out is the same
  • assume during data ingestion that you'll have duplications/redundancy - how to spot and remove?
  • starting point for data ingestion - assume this is a sequence of processes that build on each other, not a single process with all the steps done at once. this way you can swap things in, test in isolation and scale to more machines
  • list some text similarity metrics fuzzywuzzy, levenshtein, note doing char or word based similarity or char n-gram similarity, maybe removing punctuation/case/unicode is useful?
  • pandas read_csv dayfirst=False (by default, consider different for euro poorly specified dates)
  • consider linking to http://datapatterns.org/pattern/

learning strategies

  • more clean data (probably) beats smarter algorithms

clustering for EDA

cleaning

process

  • list project-types that might work and why, @springcoil talks on the requirement to invest in tooling to deliver working systems
  • r&d != engineering
  • how might r&d (e.g. 1 person) interface with an eng team?
  • which bits of an agile process seem to work well? do sprints work well (depends on the task-type)?
  • how 'owns' the data/process, can that cause problems?
  • does the lack of a shared language hinder things?
  • data scientists need clean data, the system will probably always have some dirty data, there is a need for a data-cleaning process (data eng team?) who try to improve the data quality to an agreed schema and who can export/transform the data so it can be used by the r&d team
  • building mini-monolithic-blocks is normal, remember to break them up into smaller services that can be tested else critical testing can easily be avoided (costing later development speed)
  • add logging early for anything production-like
  • luigi for task pipelines to avoid manual steps

getting hired:

  • what you need to show if you want to get hired (github, talks)
  • minimal stuff you should do to be more visible

list of tools I'd like to see

further reading

pipeline building

tools on my radar

review:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.