Comments (2)
CodeSearchNet has two tasks:
a) Given a documentation comment (e.g. Python docstring) try to find the original code snippet that matches that comment. For this task, there is plenty of data and thus supervised machine learning methods can be used to train models.
- During training, we randomize the batch elements and ask the model to learn to pick the correct code from within the batch.
- During validation/testing, the batch elements are fixed (given that the evaluation dataset doesn't change) and therefore comparing among models is possible.
- We train our models using the MRR objective using a fixed validation ordering and batch size.
b) However, documentation comments are not necessarily representative of real code search queries. For this reason we have collected a small dataset of human-collected annotations (which is hidden behind the leaderboard submission).
- Relevance annotations are commonly evaluated using NDCG. MRR does not apply.
- The ability to rank the correct snippet highly, does not necessarily correlate with NDCG/human-relevance annotations since documentation comments differ from search queries.
How should you train the model? That's up to you. We suggest using MRR and the docstring<->code task, but feel free to pick any alternative you think it would work best. How do you pick the best model? Again it's up to you.
Have a look at the Technical Report for more info.
from codesearchnet.
Hi @mallamanis
Thank you very much. The explanation is very clear.
from codesearchnet.
Related Issues (20)
- Less number of data found than stated in the paper HOT 1
- question about NDCG calculation HOT 2
- Generating Pypi module for function_parser HOT 3
- How can I get the annotated code? HOT 1
- Error when executing docker run
- Missing annoy module
- Missing code to build files *_dedupe_definitions_v2.pkl HOT 1
- NDCG computation HOT 1
- How to deconstruct code into tokens to extract functions and comments? HOT 2
- How to run the Function Parser?
- What is the difference between the Original String and code fields?
- How big the dataset is?
- Request to provide unfiltered dataset HOT 1
- Codes
- Please add the commit id for each language parser
- Expired or Private Links of Java Code Snippets in CodeSearchNET
- Clone not working HOT 1
- can we combine the original dataset and re-divided to perform cross-validation?
- dataset can not be downloaded HOT 2
- Functions with original comments
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codesearchnet.