juliandewit / kaggle_ndsb2 Goto Github PK

3rd place solution for the second kaggle national datascience bowl

License: Apache License 2.0

Python 100.00%

kaggle_ndsb2's Introduction

This is the source code for the 3rd place solution to the Second National Data Science Bowl hosted by Kaggle.com. For documenation about the approach look here

Dependencies & data

I used the anaconda default distribution with all the libraries that came with it. Next to this I used opencv(cv2), pydicom and MxNet (20151228 but later version will most probably be fine). For more detailed windows 64 installation instructions look here.

The dicom data needs to be downloaded from Kaggle and must be extracted in the data_kaggle/train /validate and /test folders.

Adjust settings

In the settings.py you can adjust some parameters. The most important one is the special "quick mode". This makes training the model 5x faster at the expense of some datascience rigor. Instead of training different folds to calibrate upon to prevent overfitting we train only one fold. This overfits a bit in step 3 and 4 but still results in a solid 0.0105 score which is enough for a 3rd place on the LB. Not choosing quick mode takes much longer to train but will result in less overfit and gives 0.0101 on the LB. Which is almost 2nd place and maybe with some luck it is.

Run the solution

python step0_preprocess.py
As a result the /data_preprocessed_images folder will contain ~329.000 preprocessed images and some extra csv files will be generated in the root folder.
python step1_train_segmenter.py
As a result you will have (a) trained model(s) in the root folder. Depending on the fold RMSE should be around 0.049 (train) and 0.052 (validate).
python step2_predict_volumes.py
As a result you will have a csv containing raw predictions for all 1140 patients. Also the data_patient_predictions will contain all generated overlays and csv data per patient for debugging. In the logs the average error in ml should be around 10ml.
python step3_calibrate.py
As a result you will have a csv file containing all the calibrated predictions. In the logs the average error in ML should go down with +/- 1ml.
python step4_submission.py
As a result the /data_submission_files folder will contain a submission file. In the logs the crps should be around 0.010.

Hardware

The solution should be gentle on the GPU because of the small batchsize. Any recent GPU supported by MxNet should do the job I figure. The lowest card I tried (and that worked) was a GT740.

kaggle_ndsb2's People

Contributors

Stargazers

Watchers

Forkers

leophill ilknuricke sarvghotra yffud brahmaslee jayinai neverspill arnocandel sergrous wuzhongdehua leliaonvidia luogongning zhouphd coloratto walterreade pyd1data mattmacy aloshkad whaozl zoonono dearkafka patagonia4 ieee820 tangzk rahasayantan ronghanghu chsasank anuragreddygv323 wwitschey zhangyang5511 weidezhang leotam wgarvino unyqhz aihill hiterstone crysris montahdaya lokeshsoni soonhwan-kwon briando2005 ngxbac nishanthjois malonec3 makendi1984 jainnitk bryanphua lihaossu yyl2016000 kriti-c3

kaggle_ndsb2's Issues

How are files in data_segmenter_trainset created?

Hi Julian,

Thanks for posting this wonderful and clear code. I couldn't find any code in the repo which creates those .png and .lst files in data_segmenter_trainset folder. Please help.

software used for creating segmentation labels

if you could kindly help me by knowing the software used for creating segmentation labels...

Keras Implementation

Hi,

Do you know if there is a keras implementation for the same and if there is something which is missing there prompting you to use mxnet?

Rahul.V.

determining the size of segmentation

I was Going through your Code ..
How do we

Determine the Area Covered By Segmentation in the Complete image in Terms of Percentage
Determine the size of segmentation which are detected...

Your Suggestion shall be highly appreciated.

step2_predict_volumes: errors when iterating an ndarray

Dear Julian,

thank you for your valuable work and your practical code.

I was running "step2_predict_volumes.py". in line 130:
predictions = pred_model.predict(pred_iter)

the code goes to models.py from mxnet .egg file. and it raises an error which says:
"d:\chhong\mxnet\include\mxnet./ndarray.h:217: Check failed: (shape_[0]) >= (end) Slice end index out of range"

I read about this problems but I couldn't fix it.

Would you please help me?

Several questions regarding your solution

Hi Julian,

Thanks for sharing the code. I have several questions after reading your solution document.

According to U-net paper, the output map is of size (row, column, 2), i.e., it has two feature channels. But looks like you only use 1 channel. Is that right? Would you like to explain more on this?

You once mentioned that "Segmentation nets are numerically unstable", would you like to elaborate more for this point? Are there any references discussing this?

You mentioned that "Note that I used relu activations and batch normalization after every convolution". With respect to "batch normalization" here, do you mean you will add
a normalization layer after convolution layer? If I remember correctly, I once heard that "normalization layer" may not be needed if we use batch normalization in the optimization method. In specific, I am not very clear what do you mean "use batch normalization after every convolution".

How many epochs do you use for training?

Thanks for the help.

Dropout placement

Hey, thanks for the nice code and blog post!

I don't know if it's important, though it confused me a bit. In the blog post the first dropout layer was placed after all downsampling layers, i.e. after pool5, but in the code I see that it was placed after pool4. I think it would be more consistent approach to put it after pool5 as in the post.

...
pool4 = convolution_module(net, kernel_size, pad_size, filter_count=filter_count * 4, down_pool=True)
net = pool4
net = mx.sym.Dropout(net)
pool5 = convolution_module(net, kernel_size, pad_size, filter_count=filter_count * 8, down_pool=True)
net = pool5
net = convolution_module(net, kernel_size, pad_size, filter_count=filter_count * 4, up_pool=True)
...