Code Monkey home page Code Monkey logo

2023-kaggle-ai-report's Introduction

2023 Kaggle AI Report

https://www.kaggle.com/competitions/2023-kaggle-ai-report realate repo

Project Goal

The primary aim of this project is to leverage the diverse expertise within the Kaggle community to centralize and summarize the rapid advancements in AI over the past two years. This project is designed to share the collective perspectives of the Kaggle community with the broader research community.

Project Context

In this analytical competition, participants will write an essay on one of the following seven topics, elucidating what the community has learned over the past 2 years of working and experimenting with:

  • Text data
  • Image and/or video data
  • Tabular and/or time series data
  • Kaggle Competitions
  • Generative AI
  • AI ethics
  • Other (anything that does not fall into any other category)

Kaggle has provided a collection of writeups and a collection of scholarly articles in the official competition dataset, but participants are free to use any resources. Participants will also engage in a peer feedback process that involves reviewing three peer essays. Submissions will be evaluated by an expert grading panel of seven Kaggle Grandmasters. Winning essays will be aggregated in a post-competition publication and credited to their respective authors.

My Idea

Project Goal

The primary goal of this project is to craft a research paper articulating the relationship between community problem-solving approaches and academic paper publications. This includes exploring questions such as: How does the community utilize academic papers to solve specific problems? Does the community prefer using techniques from newer papers or older ones? Are there any discernible trends in this respect?

Moreover, instead of manually composing the paper, I plan to employ generative models, possibly using pre-trained or addone-document Langchain technologies to aid in the generation of the paper.

Data Analysis Approach

The dataset for this project consists of Kaggle competition write-ups and metadata from arXiv papers. The arXiv dataset includes vital information such as the title, abstract, category, and release date of the papers. However, as arXiv has its categorization system that might not align perfectly with the competition's seven topic categories, the initial task would be to re-categorize the papers into one of these seven topics to ensure relevant data for the chosen theme.

The Kaggle competition write-ups also need to be categorized based on the competition's classification. Attention should also be given to the dataset's temporal information to correlate with the release dates of the papers.

Since the write-ups dataset has already filtered the optimal and sub-optimal solutions for the former related competitions, the focus should be on establishing the connection between these solutions and published papers. The goal is to construct this connection and present it as a thesis and conclusion in the paper.

Current Problem

At present, I am working on categorizing arXiv papers into competition categories. I've envisioned two strategies for this step:

  1. Use the LLM API for Zero-Shot classification based on the paper abstracts. This approach might require a considerable budget.

  2. Similar to the first approach, use LLM's free credit to generate a small-sized labeled dataset based on the paper abstracts. Given the current token consumption rate, I estimate generating around 1000 labeled instances. Then, this small dataset would be fine-tuned using an encoder model like BERT. Finally, the fine-tuned BERT, along with the classification model, would be used for classifying the remaining papers. However, I have concerns about this approach as the total number of papers to be classified is approximately 230,000. I am unsure about the effectiveness of fine-tuning with such a small sample size.

I have also tried using other Transformer-based models like BART for Zero-Shot classification. While this method is entirely free, the results have been unsatisfactory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.