Arnaud Martin
I know you probably have read too much paper already for a Christmas vacation so don't hesitate to skip to the parts summary evaluate, improve, argument. They will provide you with all you need to grade this work. The paper in itself
[TOC]
I choosed to work with the image (colorectal-histology) dataset. The dataset was released with a paper (Kather, 2016) [1]. It is composed of 5,000 images of colon divided in eight classes. The classes are:
- tumor epithelium,
- simple stroma
- complex stroma
- immune cells
- debris and mucus
- mucosal gland
- adipose tissue
- background
Authors converted images to grey scale. And developed a set of problem specific features divided in 6 broad categories that we won't detail here. Then they used four classifiers 1-NN, ensmble tree, linSVM and rbfSVM. And they defined two types of of problems. The first is multiclass problem where they try to predict the label of the image. The second is the conventional problem where they try to predictif if the patient has a cancer or not.
Multiplying feature sets, classifiers and problems there are 48 models authors compare using error rates. Error rates varies from 4% to 50%.
The point of the project is to showcase how to improve (& build) a predictive model using these images.
In my non-expert understanding, stroma (simple or complex) [2], immune cells [3], debris and mucus [4], mucosal glands [5], adipose tissue [6] and background are not conditions. The only class that require special attention is the first one [7].
I want to build (a little bit) on the authors work. I assume in a real scenario with such an application a doctor will confirm any positive. So I think error rate is not the most important metric for the patient. So I will try to make a model that prioritize not missing tumor prediction even at the expanse of predicting other classes as tumor.
My plan is to first build a model that the authors did not explore a CNN. Then to compare it's error rate to the authors models. Then depending on the time I have I will either try to improve it and compare it's recall to my model or try a model made by the authors and compare it's recall to my model.
I choosed to go for a CNN base line model because it was the one I would implement the fastest. At first I choosed not to go throught any preprocessing as I had no idea if it would have any impact for the work and since I did not had a base line model I wouldn't be able to compare the impact of the preprocessing
- I decided to focus on the tumor/not tumor problem because it is the most important to an hypothetical patient
- I used recall as a metric. Ethically a false negative could mean a patient's death while a false positive would likely be caught on by an other doctor in the healing process. False negative is the enemy in this problem.
- I used error rate as a metric as it is the metric used by the authors of the original dataset and I compared my models to theirs. (I realy used accuracy and Error rate = 1 - Accuracy)
- Just to be nerdy and try it out I made some ROC curves.
Model version | Argument | metrics | explain |
---|---|---|---|
Author's | Not mine | recall : ??? ER : 4.3~21.7 |
I think recall would be a more interesting metric. |
Baseline | - Made fast (in developement time) -No Preprocessing -CNN from flower TP - 5 epochs |
Recall : 0 ER : 23% |
The models always predict healthy. Probably because most example are healthy. |
v1 Data augment° |
-Same as base line -Data augmentation (rotation only to not trim the tumor: I have no medical knowledge) for tumor examples to 50/50 healthy/unhealthy ratio |
Recall : 0.9% ER : 99.1 (0.9=9samples) |
- Recall is not 0 but I expected at least 50% - Model output is always around 0.45 -To me it seem the model is mostly random. Maybe the task is too complex for the model -When model says positive the person is really positive |
v2 | - Same as v1 -Simplified input by going grey scale -Complexified model by adding depth and width |
ER:100% Recall:0 |
- At this point I think I have a bug -Maybe the shuffle doesn't < |
v3 | - Implemented manually train_test_split and shuffle | ER:43% Recall:0 |
I notice accuracy does not update through epochs and find it could be due to a bad learning rate selection [8] |
v4 | - I search for a good learning rate using grid search | ER:~45% Recall:0 |
For all tested learning rates (1 to 0.001) Recall always stayed at 0. |
## Bibliography & Sources
[1] - Kather, J., Weis, CA., Bianconi, F. et al. Multi-class texture analysis in colorectal cancer histology. Sci Rep 6, 27988 (2016). https://doi.org/10.1038/srep27988
[2] - https://en.wikipedia.org/wiki/Stroma_(tissue)
[3] - https://en.wikipedia.org/wiki/Nonspecific_immune_cell
[4] - https://www.frontiersin.org/articles/10.3389/fcimb.2020.00248/full
[5] - https://en.wikipedia.org/wiki/Mucous_gland