seg / 2016-ml-contest Goto Github PK

View Code? Open in Web Editor NEW

187.0 32.0 269.0 120.86 MB

Machine learning contest - October 2016 TLE

License: Apache License 2.0

Jupyter Notebook 95.66% Python 0.21% Lua 0.01% Shell 0.01% R 0.23% HTML 3.89%

machine-learning data-science geoscience contest geophysics fun

2016-ml-contest's Introduction

2016-ml-contest

THE CONTEST IS NOW CLOSED. THANK YOU TO EVERYONE THAT PARTICIPATED.

Final standings: congratulations to LA_Team!

The top teams, based on the median F1-micro score from 100 realizations of their models were:

Position	Team	F1	Algorithm	Language	Solution
1	LA_Team (Mosser, de la Fuente)	0.6388	Boosted trees	Python	Notebook
2	PA Team (PetroAnalytix)	0.6250	Boosted trees	Python	Notebook
3	ispl (Bestagini, Tuparo, Lipari)	0.6231	Boosted trees	Python	Notebook
4	esaTeam (Earth Analytics)	0.6225	Boosted trees	Python	Notebook

I have stochastic scores for other teams, and will continue to work through them, but it seems unlikely that these top teams will change at this point.

Welcome to the Geophysical Tutorial Machine Learning Contest 2016! Read all about the contest in the October 2016 issue of the magazine. Look for Brendon Hall's tutorial on lithology prediction with machine learning.

You can run the notebooks in this repo in the cloud, just click the badge below:

You can also clone or download this repo with the green button above, or just read the documents:

index.ipynb — All about the contest.
Facies_classification.ipynb — Brendon's notebook with all the code you need to get started in machine learning in Python.

Leaderboard

F1 scores of models against secret blind data in the STUART and CRAWFORD wells. The logs for those wells are available in the repo, but contestants do not have access to the facies.

** These are deterministic scores, the final standings depend on stochastic scores — see above **

Team	F1	Algorithm	Language	Solution
LA_Team (Mosser, de la Fuente)	0.641	Boosted trees	Python	Notebook
ispl (Bestagini, Tuparo, Lipari)	0.640	Boosted trees	Python	Notebook
SHandPR	0.631	Boosted trees	Python	Notebook
HouMath	0.630	Boosted trees	Python	Notebook
esaTeam	0.629	Boosted trees	Python	Notebook
Pet_Stromatolite	0.625	Boosted trees	Python	Notebook
PA Team	0.623	Boosted trees	Python	Notebook
CC_ml	0.619	Boosted trees	Python	Notebook
geoLEARN	0.613	Random forest	Python	Notebook
ar4	0.606	Random forest	Python	Notebook
Houston_J	0.600	Boosted trees	Python	Notebook
Bird Team	0.598	Random forest	Python	Notebook
gccrowther	0.589	Random forest	Python	Notebook
thanish	0.580	Random forest	R	Code
MandMs	0.579	Majority voting	Python	Notebook
evgenizer	0.578	Boosted trees	Python	Notebook
jpoirier	0.574	Random forest	Python	Notebook
kr1m	0.570	AdaBoosted trees	Python	Notebook
ShiangYong	0.570	ConvNet	Python	Notebook
CarlosFuerte	0.570	Multilayer perceptron	Python	Notebook
fvf1361	0.568	Majority voting	Python	Notebook
CarthyCraft	0.566	Boosted trees	Python	Notebook
gganssle	0.561	Deep neural net	Lua	Notebook
StoDIG	0.561	ConvNet	Python	Notebook
wouterk1MSS	0.559	Random forest	Python	Notebook
Anjum48	0.559	Majority voting	Python	Notebook
itwm	0.557	ConvNet	Python	Notebook
JJlowe	0.556	Deep neural network	Python	Notebook
adatum	0.552	Majority voting	R	Notebook
CEsprey	0.550	Majority voting	Python	Notebook
osorensen	0.549	Boosted trees	R	Notebook
rkappius	0.534	Neural network	Python	Notebook
JesperDramsch	0.530	Random forest	Python	Notebook
cako	0.522	Multi-layer perceptron	Python	Notebook
BGC_Team	0.519	Deep neural network	Python	Notebook
CannedGeo	0.512	Support vector machine	Python	Notebook
ARANZGeo	0.511	Deep nerual network	Python	Code
daghra	0.506	k-nearest neighbours	Python	Notebook
BrendonHall	0.427	Support vector machine	Python	Initial score in article

Getting started with Python

Please refer to the User guide to the geophysical tutorials for tips on getting started in Python and find out more about Jupyter notebooks.

Find out more about the contest

If you intend to enter this contest, I suggest you check the open issues and read through the closed issues too. There's some good info in there.

To find out more please read the article in the October issue or read the manuscript in the tutorials-2016 repo.

Rules

We've never done anything like this before, so there's a good chance these rules will become clearer as we go. We aim to be fair at all times, and reserve the right to make judgment calls for dealing with unforeseen circumstances.

IMPORTANT: When this contest was first published, we asked you to hold the SHANKLE well blind. This is no longer necessary. You can use all the published wells in your training. Related: I am removing the file of predicted facies for the STUART and CRAWFORD wells, to reduce confusion — they are not actual facies, only those predicted by Brendon's first model.

You must submit your result as code and we must be able to run your code.
Entries will be scored by a comparison against known facies in the STUART and CRAWFORD wells, which do not have labels in the contest dataset. We will use the F1 cross-validation score. See issue #2 regarding this point. The scores in the 'leaderboard' reflect this.
Where there is stochastic variance in the predictions, the median average of 100 realizations will be used as the cross-validation score. See issue #114 regarding this point. The scores in the leaderboard do not currently reflect this. Probably only the top entries will be scored in this way. [updated 23 Jan]
The result we get with your code is the one that counts as your result.
To make it more likely that we can run it, your code must be written in Python or R or Julia or Lua [updated 26 Oct].
The contest is over at 23:59:59 UT (i.e. midnight in London, UK) on 31 January 2017. Pull requests made aftetr that time won't be eligible for the contest.
If you can do even better with code you don't wish to share fully, that's really cool, nice work! But you can't enter it for the contest. We invite you to share your result through your blog or other channels... maybe a paper in The Leading Edge.
This document and documents it links to will be the channel for communication of the leading solution and everything else about the contest.
This document contains the rules. Our decision is final. No purchase necessary. Please exploit artificial intelligence responsibly.

Licenses

Please note that the dataset is not openly licensed. We are working on this, but for now please treat it as proprietary. It is shared here exclusively for use on this problem, in this contest. We hope to have news about this in early 2017, if not before.

All code is the property of its author and subject to the terms of their choosing. If in doubt — ask them.

The information about the contest, and the original article, and everything in this repo published under the auspices of SEG, is licensed CC-BY and OK to use with attribution.

2016-ml-contest's People

Contributors

Stargazers

Watchers

Forkers

vishalkakkar jesperdramsch rezmal alexcombessie iustenko okrepkyi dalide joshuaadampoirier geckya stevejpurves gganssle thanish cannedgeo lperozzi paleolithictool oclipa anjum48 works123 gccrowther evgenizer wouterk1mss el42 ugumba cbuie bemmerson osorensen joshlowe01 cesprey abhishekkodi mablou adatum bezova aladamasceno justingosses alexleogc rtilk89 bestagini fvf1361 ophiolite gaperezghub ringotm dagrha davidgtang zyex1108 broccoli-smuggler yuewu000 mtaufanr lukasmosser rangerbottle ckjeong73 jingbo2013 shiangyong sridharavc russellkappius-tomtom esa-as geolearnai da-wad bgcckjeong matdorling cako vts21 dahlmb timevans101 mycarta mlewand3 monkeylever akshatthakar pangmeng selkurdy enterstudio fereizqo anisbensaid miguel19c paymonh rkdsone shel-zaroo tomotoole meelement demieane qiaotian jolyphant maulingogri iamblichus phocopida xin12122 shashanksharad zhanghonglishanzai capasitore didiooi clustersdata martinvau walidaun kyubonoh neo4reo minasel benaur nicorb chengzhan gitders iterentyev

2016-ml-contest's Issues

Publication vs. Contest

If something is submitted to the contest, can it still be published in an article in the leading edge?

This may be a bit too meta but:

Making submissions public on Github would make it possible for anyone to take your approach and tweak it slightly and then publish it as their own?

How do other contests such as Kaggle handle this?

Where to find the closed issues

Hi @kwinkunks, first of all sorry of opening this issue which is more of a knowing Github than this contest. I'm logging back to this contest after few weeks. I'm lost where I have left it. I don't want to start over again the few of the analysis and the questions I raised before as all are closed now. But I am unable to see the closed issues. If I could view them and also few of the issues that were raised by other members and which closed now. I can get started easily.

Using validation data

It is usually a bad idea for the validation data to affect the trained model, such as including the validation data when computing the mean for standardization, as this can lead to overly optimistic validation scores.

If there are already lots of boreholes drilled with log data but no core facies classification, and the goal is simply to classify these wells, I could imagine potentially including all of the log data when standardizing the data. If the goal is instead to make something which can be applied to the existing wells, and also any future boreholes that are drilled, then I think it would be unwise.

Should we make this against the rules of the contest, or is it permissible in this case?

Predefined "blind" well data as measure of performance leads to over-fitting.

The contest outlines that the same well as in the publication will be used to judge the performance of the
proposed solutions. This can lead to overfitting by using the prediction capability for
the proposed well as a loss function.

Should another performance measure be used to compensate for overfitting?

Blind set inconsistent in article and Jupyter notebook

Can someone help me understand what is going to be the blind dataset ?

In the index.ipynb it has been said that the blind dataset is going to be same as used by Brendon

When I checked the jypter notebook I found that the blind data is the well NEWBY

But in the magazine publish in Oct 2016 I could find that "SHANKLE" being used as the blind dataset.

Just notice something

HI @kwinkunks, I know it has been one year already, I just happen to take a look at this repo again and found in the utils.py that the score used is "accuracy", not the actual "F1 score", is that right?

Is there a leaderboard

Hi is there going to be a leader board where we can see the score of our model and the other contributors ?

The magic number 0.43

The prediction was done on which testing dataset to reach an accuracy of 0.43 ? I mean, was the prediction done on the blind data set with SHANKLE well or NEWBY well or any other data?

How can we compare the true labels to the predicted labels?

Hi,
It is really cool competition. Unfortunately, we could not join this competition. I wish we could. But, I have checked the most of repos and I have realised that "True labels", considering that file 'blind_stuart_crawford_core_facies.csv', have 890 data when the predicted submissions of teams have only 830 data. Is it a mistake or update in the repo? How can we decide our test accuracy ourselves? Thank you for your time,
Vural

Help with how to make submission

Hi guys, can some one tell me how to submit the results to this contest. I'm certainly new to github and everything seems out of my way. I'm not sure if my uploaded file was notified to the admin after I made the pull request. Infact I'm not sure if the pull request I did was correct. I'm not able to see my notebook like how it appears for the others. It looks like a html code for me. If you could post a short video right from forking or cloning to creating a separate folder for the team and uploading and submitting (pull request) the results it would be very helpful for beginners like me. I hope I'm not asking too much.

Thanks
Thanish

Thanks

Thanks for organizing this ML contest. It was fun and a very useful learning experience. I blame the data for my poor result :) Looking forward to the MLevent at the EAGE-Paris this year.

Is there interest for a geo-ML linkedin group for connecting and further discussions? or does it already exist?

Also, are there any plans to have a follow-up on this? I thought being able to see other people's notebook was helpful on one hand but also leads to drags down the plurality of methods. Maybe for a next one I would suggest keep the top5 hidden, or make sharing optional till the results.

Cant figure out how to submit my notebook

New to GitHub and I cant figure out how to submit my notebook despite looking at "Help with how to make submission #27".

How do i make my own folder in the SEG/2016-ml-contest --- 'create new file' ??
then I have to make the pull request?

I have the notebook uploaded onto my git page...

Any help?
Thanks

Training_data.csv?

First, thanks for putting this exercise together. And also please excuse me stumbling through this as I learn pretty much from scratch. But, just wanted to clarify the csv file used in the paper is 'training_data', but the tutorial "facies_classification" notebook uses 'facies_vectors' then renamed to 'training_data', right? The absence of PE data for a couple wells in the tutorial that are not used in the paper (training_data.csv) is the main reason I ask.

Thanks!
Bryan

One question

first of all, thanks for organizing this contest. It was a great experience. I have been with work for the last few weeks and haven't been noticing the change. I have one question that I don't understand is that PA_team jumps to the top in the stochastic scoring. Is it because you didn't put a highest score from randomly selected seeds?

Facies Formations and NM_M

I have been messing with things trying to see if a classifier would work well using the marine and non-marine formations and facies separately. This could be a terrible idea, but its also been fun for me learning code. Anyway! It looks like there may be some discrepancies in the data, where a NM facies (1,2,3) shows a M classifier (2) and vice versa that marine facies show a NM classifier. There are not a lot of them (total=~50), but they do exist. And are not present in all the wells but some wells have more than others (eg cross H). Has anyone else noticed this? Or maybe I'm just doing it wrong?? I made a notebook for this here. Also wonder how much it would actually affect predictions bc its a small number, and does the validation data contain the same discrepancies? Any thoughts? Thanks!! @kwinkunks

Can also try this code to check.
td_NM = training_data.loc[training_data['Facies'].isin([1,2,3])]
td_NM.loc[td_NM['NM_M'].isin([2])]

Definition of scoring method unclear

Which score will be used to score an individual prediction?

accuracy(confusion_matrix)
accuracy_adjacent(confusion_matrix, adjacent_facies)

skm.metrics.f1_score(Y_true, Y_pred, sample="weighted")
accuracy*accuracy_adjacent?

Clarification on 3 files of same content

I could see 3 files

well_data_with_facies.csv
validation_data_nofacies.csv
nofacies_data.csv
having the same data which possess info about 2 wells "STUART" , "CRAWFORD" with the same number of rows and the only difference being Facies column being present in "well_data_with_facies.csv". Please do clarify what is the use of these 3 datasets. I am guessing this is the test dataset on which the prediction should be done if the testset data is not blind data. Correct me if I am wrong

can I get access to nofacies_data.csv with the solution

training dataset

Uhm maybe a dumb question, but I see some people using "facies_vectors.csv" and others "training_data.csv". What is "the" training set or is it all up to us?

What is going to be the test set

So which one exactly is going to be the test set ?
is it the blind set(the Shankle well) or anything else ?

PE regression?

Hi all- Has anyone tried generating a PE for the wells that do not have it (Alexander D and Kimzey A)?
I have made some attempts using from sklearn.svm import SVR and playing around with the different models. Do you think this would lead to a valid answer? Are there better regression techniques to use?

I started a repo for PE regression that includes a notebook I have been playing with. To call it sloppy is probably an understatement, but its there as a work in progress.

Are all submission scores visible in the leader-board?

Hi @kwinkunks

Enjoyable contest, thanks for the efforts you're putting in! Just wondering whether scores for all submissions are put into the leader board? Or only the top scores for each team? I haven't seen a score for my second submission and I wanted to ascertain how far off the mark it might be!!

Cheers,
George

October issue link is broken

In the README, there's a link to http://library.seg.org/toc/leedff/35/10 which says "The Leading Edge, volume 35, issue 10 was not found."

Data description

Hi @kwinkunks I think I did see somewhere which explains about what each feature in the provided dataset means. But I am not sure though whether I really read the feature description. But in case if I'm missing something can you provide me the link for it or if it's not present now can you add it as data description in the repo ? Thanks

Reconfirming the evaluation metric

Think It might have already been discussed in #4 but just to reconfirm, what is the evaluation metric for this contest ? it's F1 which is 2 * (precision * recall) / (precision + recall) right ?
and not accuracy which would be (sum of the diagonal of confusion matrix) / total number of test data(or blind data)

Random seed and model stability

Hi everyone,

As everyone has seen, the random seed can have a significant effect on the prediction scores. This is due to the fact that most of us are using algorithms with a random component (e.g., random forest, extra trees...).
The effect is probably enhanced by the fact that the dataset we are working on is small and non stationary.

Matt has been solving the problem by testing a series of random seeds and taking the best. This avoids discarding a model just because of a "bad" random seed. However, this might favor the most unstable models. A very stable model will yield scores in a small range when testing several random seeds, while an unstable model will yield a wide range of scores when testing several random seeds. Thus, it is likely that an unstable model can get a very high score given enough random seeds are tested. But it does not mean the model will be good at predicting new test data.

A possible solution would be to test 10 (or an other number) random seeds and to take the median score as the prediction score. It would require us to directly include that in our scripts to avoid further work for Matt. We could just make 10 predictions, using 10 random seeds and export them in a single csv file.

What do you guys (and especially Matt) think about that?

access is denied when I tried to 'push' the recent file changes

I did the same job as before but I faced an error as follows.
Here is the message:

git.exe push --progress "origin" master:master

remote: Permission to seg/2016-ml-contest.git denied to ckjeong73.
fatal: unable to access 'https://github.com/seg/2016-ml-contest.git/': The requested URL returned error: 403 git did not exit cleanly (exit code 128) (3500 ms @ 2017-01-27 am 3:32:07)

Is anyone who met the same issue?

Not working code

Hello.
Why doesn't the code part work for the same logging data?
TypeErrorTraceback (most recent call last)
in ()
12 return labels[row['Lith_Section']-1]
13
---> 14 training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)
15 training_data.describe()

5 frames
in label_facies(row, labels)
10
11 def label_facies(row, labels):
---> 12 return labels[row['Lith_Section']-1]
13
14 training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)

TypeError: ('list indices must be integers, not float', u'occurred at index 0')

Thank you

Number of submissions

Hi @kwinkunks considering it's not a traditional format of Machine Learning contests which would be upload the .csv file and it would automatically give the accuracy. I just want to know can we submit 2 or more model(by 1 user) from the time you might have scored their previous model ? Till now I have been building one model waiting for you score to submit my next. I feel like it's one submission per day and a lot of time is left idle in between. Let me know your thoughts.

Any restriction on the number of features ?

Is it mandatory that only the five wire line log and the two geologic constraining features be used for modelling ? I would like to explore more and work on feature engineering. Please do confirm