Dialog System, that performs One-Class Classification, Question Answering, Sentence Similarity, and Natural Language Generation tasks
The task was to create a bot, which would welcome the user only if he or she sends the greetings first. The bot should tell the right answers for Data Science interview questions if they exist in the Database. Otherwise, the bot should generate answers with the help of NLG techniques. If the user wants to stop the conversation, the bot should classify that intention and say 'See you soon! Bye!'
Development of the project consisted of 4 main parts:
- One-Class Classification: Preprocessing text and building 2 separate models for greeting and quitting conservation intend classification. Gathering datasets for that purpose. Solution based on
OneClassSVM
model from sklearn. Evaluation of model performance with f1 score metric - Question Answering: retrieval of question intend with text summarization model
google/pegasus-xsum
, applying Sentence Transformersall-mpnet-base-v2
and cosine similarity for comparing questions with the given context, information retrieval with BERT model, pre-trained on SQUAD2 Datasetdeepset/bert-base-cased-squad2
for answering the question - Natural Language Generation: using
microsoft/DialoGPT-medium
to generate text - Development of an end-user Dialog System, that can perform the conversion using models decribed above
- For the task of One-Class Classification, 2 datasets were created: 'Greetings' and 'Goodbyes'. They consist of common expressions of greetings and farewells. To improve the performance of the model, the datasets could be extended.
- For the QA part in is used the dataset of Data Science interview questions. It has only 323 rows. For future improvements to the information retrieval part of the project, it is worthy to find a bigger one.
P.S. To make transformer models more specific to our use case, it is worthy to fine-tune them on datasets related to computer science topics. It can be also a scrapped Quora/Stackoverflow questions and answers.
How does class ChatBot work?
- Wait for the user's input to start the conversation
- Classify whether the message has the intention to end the conversation with the
OneClassSVM
model: if yes - the chat ends. - Classify whether the message is a greeting with the
OneClassSVM
model: if yes - randomly choose 1 of 4 sentences and say 'hello' - If the message is not a greeting - extract the main idea of the sentence with
pegasus-xsum
summarization model. It helps to reduce redundant words, which make text search for similar sentences more difficult. - Check whether the Data Science interview questions database has a similar question (by building word embeddings with the
Sentence Transformers
model and comparing questions withCosine Similarity
). - If the user's input is a question from the base - call the answer_with_BERT() function and perform the information retrieval (with Bert, pre-trained on SQUAD dataset
bert-base-cased-squad2
) - If a similar question to the input is not found in the base - tokenize it, save it in history_ids, and launch the
DialoGPT
model to generate the answer - Repeat steps until the intention to quit found
ConversationalAI.ipynb
(nbviewer) - Notebook version, in which Text Summarization Model wasn't used and function for computing sentence embeddings wasn't optimized: it is less accurate and works slower. There is also research, like trials of 3 ways of comparing sentence similarities: Levenshtein Distance, Jaccard Distance, and Cosine Similarity (that was chosen for the final realization of tasks)ConversationalAI_v2.ipynb
(nbviewer) - Notebook version, withpegasus-xsum
Text Summarization Model and loaded pre-built sentence embeddings for DS interview questions datasetsentence-transformers_embeddings_qa_data.pkl
and Classification Modelsmodel_greet.pkl
,model_bye.pkl
from the previous notebook.
Text Summarisation Model extracts the core idea of message, which solves the problem of comparing different sentences with similar meaning:
What does linear regression stand for?
What is linear regression
Tell me pls, what is linear regression?