timm / ezr Goto Github PK

Explanation system for semi-supervised multi-objective optimization

License: BSD 2-Clause "Simplified" License

Makefile 1.42% Lua 56.90% Awk 11.46% Python 27.13% CSS 0.35% Shell 2.19% HTML 0.55%

active-learning ai easier explanation multi-objective optimization python refactoring-exercise semi-supervised-learning teaching

ezr's Introduction

Easier AI (just the important bits)

For over two decades, I have been mentoring people about SE and AI. When you do that, after a while, you realize:

When it is all said and done, you only need a dozen or so cool tricks;
Other people really only need a few dozen or so bits of AI theory;
Everyone could have more fun, and get more done, if we avoided the same dozen or so traps.

So I decided to write down that theory and those tricks and traps (see below). I took some XAI code (explainable AI) I'd written for semi-supervised multiple-objective optimization. Then I wrote notes on any part of the code where I had spent time helping helping people with those tricks, theory and traps.

Here is how the notes are labelled. For way-out ideas, read the 500+ ones. For good-old-fashioned command-line warrior stuff, see 100-200

Odd number items are about SE;
So even numbers are about AI;

Anit-patterns (things not to do)	SE system	SE coding	AI coding	AI theory (standard)	New AI ideas
00 - 99	100 - 199	200-299	300-399	400 - 499	500-599

One more thing. The SE and AI literature is full of bold experiments that try a range of new ideas. But some new ideas are better than others. With all little time, and lots of implementation experience, we can focus of which ideas offer the "most bang per buck".

Share and enjoy.

Setting Up

Get some example data

Installation

First get some test data:

git clone http://github.com/timm/data

Just grab the code:

git clone http://github.com/timm/ezr
cd ezr/src
python3 -B ezr.py -t path2data/misc/auto93.csv -e all

Or install from local code (if you edit the code, those changes are instantly accessible):

git clone http://github.com/timm/ezr
cd ezr
pip [-e] install ./setup.py
ezr -t path2data/misc/auto93.csv -e all # test the isntall

Install from the web. Best if you want to just want to import the code, the write you own extensions

pip install ezr
ezr -t path2data/misc/auto93.csv -e all # test the install

Running the code

This code has lots of eg.xxx() functions. Each of these can be called on the command line using, say:

 python3 -B ezr.py -e klass      # calls the eg.klass() function

ezr's People

Contributors

Stargazers

Watchers

Forkers

andre-motta sairajzero 2samferguson kkganguly lohithsowmiyan sid1238 sathiya06 ferguson19sam amiiralii

ezr's Issues

stats

sway

need a second run trhough on ranges to merge the low value ones.

quick tut on kmeans

should be 6 lines

use same hook thing as naive bayes. if done incrementally, could run $k \in (2,3,4,8)$ and kernel $\in (trig,uniform)$ all at the same time

need a text tool to handle some simple NLP pre-processing

need a core idea doc

surfing the long tail

where there is little date

compression is intelligence

1GB picture of a straight line can be condesed to m,b o y=mx+b
better yet, condense to two end points
- now we have anomaly detector (anything off the line between them, anything away from our two poles)
- now we have runtime certification: summarize the training data, complain when runtime data falls outside the space of things seen during training
- and now we have a compression algorithm (anything new thata aint an anomaly can be ignore)
- and now we have on-line learning. if anomalies, recluster that region of the daa

of course, in practice, we'll need more than 2 points. care to guess how many? often less than 100 (to map out 50 lines)

less is more

not the best thing
but things statisitcally indistinguishable from the best
e.g.
- $N(\mu=0, \sigma=1)$ effectively runes -3 to 3.
- Cohen's rule says anything closer than $0.35*\sigma$ is different by a small effect of less
- $0.35/(3 - -3)\approx 5$%. so there are only 17 statistically significant different solutions
- according to Hamlet the number of random samples needed to be 95% certain of finding something with
  p=0.05 is
  - $n(C=0.95, p=0.05) = \log(1-C)/\log(1-p) \approx 49$
And if had some smart hueristic to sort that being better than that, we apply $\log_2$ to the above.
- so, with some smarts, we can explore the world with $\log_2(49)\approx 6$.

need a sickilearn tool to turn our data into skl format.

import rahul's similarity code

https://gist.github.com/yrahul3910/553f255e4305a82b32da14bf23db805c

trees

finish the repo

in /readme.md, the scripts need to be listed
in /docs, please delete all those *.html. and if those image files are not used, they can go too
please delete /erz
please get rid of /*.html
inside src, is there crap that can be pruned?
please add pdf of the emse paper to /docs

er... what else?

look for more data sets

kewen's veer paper and vivek's bad learner paper lists a bunch of SS-* files. have we got them in /data?

https://dl.acm.org/doi/10.1145/3106237.3106238
https://arxiv.org/abs/2106.02716

smo too sloq

rest never needs to be rebuilt
only best.

needs stats working