Code Monkey home page Code Monkey logo

cis-700-7_twitterchatbot's Introduction

CIS-700-7_TwitterChatbot

Homework 3

Chatbot trained on Twitter dataset as part of Homework 3 for CIS 700-7. It has been integrated with the 'chat' skill on Alexa.

Evaluation of chatbot

There are different aspects we considered for evaluation of our chatbot. First we present a qualitative evaluation along with good and bad examples.

Good sample conversations

User Bot
"hey" "good morning all"
"how are you." "i ' m ok . thanks ."
"who is hillary clinton?" "hillary is the one who cares ."
"you are sweet" "i am a good friend "
"who is donald trump" "he is an idiot"
"who do the like the best." you ' re the best i ' ve ever seen .
"cookies" "i need that !"
"thanks." "no worries !"
"get lost" "i ' m sorry ."
"no way." "that is what i said ."
"bye" "i ' m so fucking sad"

The replies of the bot seem relevant and coherent. Some responses might even convince a witness that a human was talking.

Bad sample conversations

User Bot
"tell me something interesting" "the original thing that was in the 00s"
"what do you think of global warming" "something better than me . . ."
"what do you think of alexa." "wow , i ' m so fucking gross ."
"mac or windows" "just in the UNK"
"i am really hungry." "i ' m not gonna try it ."
"let us go out for a movie." "not the same thing ."

The replies of the bot , while mostly gramatically correct, don't make sense. The bot isn't able to give satisfactory responses to the user. In one case , 'UNK' i.e. the unknown token also comes up.

Train loss

Cross entropy was used as the loss function during training. 10% of the data was used for cross validation. Training and cross validation loss was observed after every 200 epochs. A set of 15 dialogs was created to test the model based on the response generated. This was done after every 2 hours. Interestingly, minimum cross validation perplexity did not give the best model (in terms of the response generated).

Train perplexity: 14.91

eval: bucket 0 perplexity 62.31

eval: bucket 1 perplexity 68.73

eval: bucket 2 perplexity 70.96

eval: bucket 3 perplexity 66.86

Possible metrics Since the bot is not task based but free-form, it is harder to evaluate as we cannot just use task completion metrics. A combination of metrics may be best to evaluate the general purpose bot. According to Grice's maxims, effective communication in conversation can be achieved by:

  • Quality: speaker’s utterance is the truth as provable by adequate contextual evidence or domain facts
  • Quantity: speaker utterance provides as much information as appropriate, not more, not less
  • Relation: speaker’s utterance is relevant to the context and the topic of the conversation
  • Manner: speaker’s utterance is direct and straightforward

In light of these maxims, some candidate metrics could be:

  • Distance in sentence embeddings
  • Confusion metric: how often it returns ‘I don’t know’ or UNK or a similar response
  • Syntax tree of the generated response
  • Length of the response (to avoid very short and very long answers)
  • Overlap of topics in the entire user's dialog and the entire bot's response to ensure relevancy
  • Number of unique tokens to measure information gain

Issues and possible solutions

Issue: The bot response varies depending on the presence/absence of punctuations. Possible solution: Use TweetTokenizer to remove emoticons and punctuations (apart from '.', since AMAZON.LITERAL does not capture other punctuations).

Issue: Lowest perplexity doesn't guarantee the best model. Possible solution: Use a more appropriate loss function or a better metric to evaluate the model.

Issue: The chatbot doesn't remember previous statements and has no understanding of context. Possible solution: Incorporate memory for recollection and context of conversation.

Future work

Future work consists of dealing with the above-mentioned issues. Alternative datasets can be considered to cover a wider range of topics for training. A reinforcement architecture can also be added to inculcate feedback from the users into the model, which may include sentiment/tone analysis. User-specific personalities can be created for the bot which come into play for a given person. The twitter chatbot can also be compared with other bots like ALICE, TickTock.

cis-700-7_twitterchatbot's People

Contributors

vmansi avatar gjena avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.