Code Monkey home page Code Monkey logo

question-answering-with-bert's Introduction

Question Answering System with BERT

Introduction

In this notebook we are trying to build a Question Answering system with BERT on the Squad Dataset

BERT

BERT was released on 11th Oct 2019 by Google. BERT is a Bidirectional Transformer (basically an encode-only ) with a Masked Language Modelling and Next Sentence Prediction task, where the goal is to predict the missing samples. So Given A_C_E, predict B and D.

BERT makes use of Transformer architecture (attention mechanism) that learns contextual relations between words in a text. BERT falls into a self-supervised model category. That means, it can generate inputs and outputs from the raw corpus without being explicitly programmed by humans. Since BERT's goal is to generate a language model, only the encoder mechanism is necessary.

As opposed to directional models, which read the text input sequentially (left to right or right to left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

p1_bert_highlevel

The diagram above is a high-level Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors, in which each vector corresponds to an input token with the same index.

When training language models, there is a challenge of defining a prediction goal (self-supervision).To overcome this challenge, BERT uses two training strategies.

MASKED LANGUAGE MODEL

Before feeding word sequence into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked words in the sequence. In technical terms, the prediction of the output words requires:

  1. Adding a classification layer on top of the encoder output
  2. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
  3. Calculating the probability of each word in the vocabulary with softmax.

NEXT SENTENCE PREDICTION

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequence sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequence sentence in the original document, while in the other 50% a random sentence from the corpus is chosen. The assumption is that the random sentence will be disconnected from the first sentence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

  1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
  2. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embedding with a vocabulary of 2.
  3. A positional embedding is added to each token to indicate its position in the sequence.

To predict the second sentence is indeed connected to the first, the following steps are performed:

  1. The entire input sequence goes through the transformer
  2. The output of the [CLS] token is transformed into a 2x1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
  3. Calculating the probability of IsNextSequence with SoftMax

p1_bert

While training the BERT model, Masked LM and NSP are trained together, with the goal of maximizing the combined loss function of the two strategies.

The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words (this makes solving this problem even harder as we have reduced the supervision further). As a consequence, the model converges slower than directional models, a characteristic that is offset by its increased context-awareness.

Training Logs

p1_training_logs

Training loss

p1_training_loss

**Sample Results**
question       >> When was Dali defended by the Yuan?
predicted answer >> 1253

question       >> What molecules of the adaptive immune system only exist in jawed vertebrates?
predicted answer >> immunoglobulins and T cell receptors

question       >> What does the capabilities approach look at poverty as a form of?
predicted answer >> capability deprivation

question       >> How much can the SP alter income tax in Scotland?
predicted answer >> up to 3 pence in the pound

question       >> The French thought bringing what would uplift other regions?
predicted answer >> Christianity and French culture

question       >> What organization is the IPCC a part of?
predicted answer >> World Meteorological Organization

question       >> At what pressure is water heated in the Rankine cycle?
predicted answer >> high pressure

question       >> What limits the Rankine cycle's efficiency?
predicted answer >> the working fluid

question       >> In what year did Joseph Priestley recognize oxygen?
predicted answer >> 1774

refer to complete solution ๐Ÿ‘‰ here.

question-answering-with-bert's People

Contributors

krishnarevi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.