Code Monkey home page Code Monkey logo

zac2022-e2e-qa's Introduction

Solution for Zalo AI Challenge 2022 - E2E Question Answering

Overview [WIP]

Pipeline gồm 4 bước chính:

  1. Cắt data wikidump thành các sliding windows kích thước 256.
  2. Tìm candidate contexts bằng BM25 (Recall@200 ~ 0.95)
  3. Rank lại top200 candidate contexts bằng model BERT sentence pair.
  4. Tìm candidate answers từ contexts, chọn kết quả cuối cùng bằng mojority vote + community detection w/ Louvain.
  5. Tìm top100 candidate articles cho answer bằng BM25, rank lại bằng một model BERT sentence pair khác để tìm article cuối cùng.

Requirements

transformers==4.24.0
git+https://github.com/witiko/gensim.git@feature/bm25

Inference example

Tải pretrained models và các data càn thiết từ: link, giải nén vào thư mục ./data/

Tham khảo notebook example

question = "Công ty mẹ của Zalo là gì"

Lấy top200 contexts bằng BM25

query = preprocess(question).lower()
top_n, bm25_scores = bm25_model_stage1.get_topk_stage1(query, topk=200)
titles = [preprocess(df_wiki_windows.title.values[i]) for i in top_n]
texts = [preprocess(df_wiki_windows.text.values[i]) for i in top_n]

Rerank bằng BERT sentence pair

question = preprocess(question)
ranking_preds = pairwise_model_stage1.stage1_ranking(question, texts)
ranking_scores = ranking_preds * bm25_scores

Tìm câu trả lời tốt nhất bằng model QA

best_idxs = np.argsort(ranking_scores)[-10:]
ranking_scores = np.array(ranking_scores)[best_idxs]
texts = np.array(texts)[best_idxs]
best_answer = qa_model(question, texts, ranking_scores)

Entity map để tìm ra câu trả lời cuối cùng

bm25_answer = preprocess(str(best_answer).lower(), max_length=128, remove_puncts=True)
bm25_question = preprocess(str(question).lower(), max_length=128, remove_puncts=True)
candidates, scores = bm25_model_stage2_title.get_topk_stage2(bm25_answer, raw_answer=best_answer)
titles = [df_wiki.title.values[i] for i in candidates]
texts = [df_wiki.text.values[i] for i in candidates]
ranking_preds = pairwise_model_stage2.stage2_ranking(question, best_answer, titles, texts)
final_answer = titles[ranking_preds.argmax()]

zac2022-e2e-qa's People

Contributors

suicao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.