Code Monkey home page Code Monkey logo

amineagrane / web-scraping-and-topics-modeling-android-appstore Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 8.25 MB

This project consists in performing a Topics Modeling as well as a sentiment analysis on user opinions of Android applications. Data is extracted using Web Scrapping from the Google Play Store.

License: Apache License 2.0

Python 3.51% Jupyter Notebook 96.49%
android-application webscraping topic-modeling beautifulsoup4 selenium chromedriver latent-dirichlet-allocation sentiment-analysis gensim pyldavis

web-scraping-and-topics-modeling-android-appstore's Introduction

Introduction

Monitoring and responding to customer feedback is an essential first step, yet companies don't always take the time to analyze these valuable elements. Indeed, who better than a person who has had a positive or negative experience with a product or service to give their opinion? Today, with the explosion of the Internet, social networks and smartphones, the totality of user opinions represents an important mass of information.

In the case of mobile applications, every day millions of users share their thoughts and criticisms on Google Play and the Apple Store. Users then express their feelings and feedback after using the application. Faced with this mass of data, traditional marketing studies and techniques are now outdated. New techniques must be adopted to automate and optimize the analysis of user feedback...

Users reviews allow a better understanding of the consumption habits and uses of the products offered by the company. They also highlight the positive and negative points of the customer journey. They are therefore very valuable data that enriches the company with knowledge about its current and future customers.

This project consist in 3 distinct parts :

  • Web Scrapping
  • Sentiment Analysis
  • Topics Modeling

Web Scrapping :

The first step in our project was extraction of user data from the Google Play Store. This will be done using Web Scrapping. The objective is to extract the content of a page from a site in a structured way. The main interest of Web Scrapping is to be able to harvest content from a website, which cannot be copied and pasted without distorting the very structure of the document. For this project, I wrote a Python script to perform Scrapping of user data and storage of these data in a structured form, which is a csv file.

The web scraping script was achieved using the BeautifulSoup and Selenium modules. The extracted data are stored in the "scraped_reviews" folder. For example, I scraped 10.000 reviews from Android applications like Instagram, Facebook, Netflix, etc

There is 7 columns inside the csv :

  • user_name : Username of the Google account
  • date : Date of the reviews
  • num_stars : Number of stars of the review (1 to 5)
  • review : Textual review of the user
  • num_likes : Number of likes the review received from other users
  • user_name_answer : user_name of the ansewr
  • date_answer : Date of the answer
  • answer : Textual content of the answer

Topics Modeling with Latent Dirichlet Allocation :

Topîc modeling is a text mining model, using unsupervised and supervised statistical machine learning techniques to identify themes in a corpus or large amount of unstructured text. From a collection of documents, the model will group words into word clusters, identifying topics, through a process based on similarity.

Latent Dirichlet allocation is a popular model for fitting a subject model. It treats each document as a mixture of topics and each topic as a mixture of words. This allows documents to "overlap" in terms of content, rather than being separated into distinct groups, in a way that reflects typical natural language usage.

2-dimensional visualization of the Topics Modeling model :

We can perform a two-dimensional visualization of the topics extracted from the scrapped dataset. To do so, we use the library which is available on Python. pyLDAvis is a Python library for interactive LDA visualization. Below,we can see a figure which is a screenshot of the 2D visualization of the LDA model I implemented:

The size of the circle represents the importance of each topic on the whole corpus, the distance between the center of the circles indicates the similarity between the topics. For each topic, the histogram on the right side lists the 30 most relevant terms. LDA helped me to extract 7 main topics.

web-scraping-and-topics-modeling-android-appstore's People

Contributors

amineagrane avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.