Code Monkey home page Code Monkey logo

boddepallyvenkatesh06 / clean-and-analyze-social-media-usage-data-with-python Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 683 KB

Utilized Python for data cleaning and analysis of social media usage data, employing libraries like Pandas and Matplotlib. Improved insights for informed decision-making and business strategies.

Home Page: https://secnhack.in/social-analyzer-analyzing-finding-a-person-profiles/

License: GNU Affero General Public License v3.0

Dockerfile 0.17% JavaScript 66.54% Python 32.12% Shell 1.18%
pyhon python-lambda python-script pythonframework pythonsqlite

clean-and-analyze-social-media-usage-data-with-python's Introduction

Clean-and-analyze-social-media-usage-data-with-Python

Twitter Sentiment Analysis


Table of Contents


Problem Statement

The purpose of this analysis is to gain insights into the number of engagements the #NG30DaysofLearning has on twitter.

For this study we examined a variety of categories: the number of tweets, number of users, the most active users, the most mentioned tools e.t.c


Data Sourcing

  • The dataset used for this analysis was scrapped from twitter jupyter notebook
  • The preview of the dataset and python code is shown below:
!pip install snscrape
Requirement already satisfied: snscrape in c:\users\user\anaconda3\lib\site-packages (0.4.3.20220106)
Requirement already satisfied: lxml in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.8.0)
Requirement already satisfied: beautifulsoup4 in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.11.1)
Requirement already satisfied: requests[socks] in c:\users\user\anaconda3\lib\site-packages (from snscrape) (2.27.1)
Requirement already satisfied: filelock in c:\users\user\anaconda3\lib\site-packages (from snscrape) (3.6.0)
Requirement already satisfied: soupsieve>1.2 in c:\users\user\anaconda3\lib\site-packages (from beautifulsoup4->snscrape) (2.3.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2021.10.8)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.7.1)
import pandas as pd
import snscrape.modules.twitter as sntwitter
query = "(#30DaysOfLearning OR #NG30DaysOfLearning) until:2022-06-26 since:2022-05-05"
tweets = []
limit = 30000


for tweet in sntwitter.TwitterHashtagScraper(query).get_items():
    
    if len(tweets) == limit:
        break
    else:
        tweets.append([tweet.date, tweet.url, tweet.user.username, tweet.sourceLabel, tweet.user.location, tweet.content, tweet.likeCount, tweet.retweetCount,  tweet.quoteCount, tweet.replyCount])
        
df = pd.DataFrame(tweets, columns=['Date', 'TweetURL','User', 'Source', 'Location', 'Tweet', 'Likes_Count','Retweet_Count', 'Quote_Count', 'Reply_Count'])

df.to_csv('30DLTweets.csv')
df.head()
Date TweetURL User Source Location Tweet Likes_Count Retweet_Count Quote_Count Reply_Count
0 2022-06-25 22:51:18+00:00 https://twitter.com/poetrineer/status/15408300... poetrineer Twitter for Android Oyo, Nigeria So as one of my commitment to document my lear... 0 0 0 1
1 2022-06-25 22:44:10+00:00 https://twitter.com/poetrineer/status/15408282... poetrineer Twitter for Android Oyo, Nigeria Finally, here is my updated COVID-19 Data Anal... 1 0 0 0
2 2022-06-25 19:25:58+00:00 https://twitter.com/MichealOjuri/status/154077... MichealOjuri Twitter Web App Oyo, Nigeria #30NGDaysOfLearning\n#30daysoflearning \n#micr... 0 0 0 0
3 2022-06-25 16:44:36+00:00 https://twitter.com/oye__aashu/status/15407377... oye__aashu Twitter for Android Nainital, India Day 4/ #30daysoflearning learned all about arr... 3 0 0 1
4 2022-06-25 12:49:02+00:00 https://twitter.com/hsb_data/status/1540678455... hsb_data Twitter Web App New Jersey Learning about sub queries on @DataCamp (SQL) ... 1 0 0 0
df.describe()
Likes_Count Retweet_Count Quote_Count Reply_Count
count 683.000000 683.000000 683.000000 683.000000
mean 15.780381 3.812592 0.185944 1.166911
std 41.164555 12.665561 0.737938 2.626554
min 0.000000 0.000000 0.000000 0.000000
25% 1.000000 0.000000 0.000000 0.000000
50% 2.000000 0.000000 0.000000 0.000000
75% 8.000000 2.000000 0.000000 1.000000
max 549.000000 248.000000 9.000000 29.000000

The dataset is also available at 30DLTweets


Data Preparation

Data transformation was done in Power Query and the dataset was loaded into Microsoft Power BI Desktop for modeling.

The Twitter Sentiment dataset is given by a table named:

  • 30DLTweets which has 10 columns and 680 rows of observation

The tabulation below shows the 30DLTweets table with its column names and their description:

Column Name Description
Date Represents the date and time of tweet
TweetURL Describes the tweet url
User Describes the username of the user
Source Descibes the device type of the user
Location Describes the location of the user
Tweet Describes the content of the tweet
Likes_Count Represents the count of likes of the tweet
Retweet_Count Represents the count of retweets of the tweet
Quote_Count Represents the count of quote tweets on the tweet
Reply_Count Represents the count of reply on the tweet

Data Cleaning for the dataset was done in power query as follows:

  • The 30DLTweets table was split into a dimension and 2 fact tables respectively, hence called:
  1. UserProfile
  2. TweetStats
  3. TweetProfile
  • A calculated column UserID was created in each of the tables using the M-formula UserID = [User] & "_" & [Source]
  • Unnecessary columns were removed in each of the tables
  • Each of the columns in the tables were validated to have the correct data type

To ensure the accuracy of the dates in the Date column of TweetStats and TweetProfile tables, a date table was created for referencing using the M-formula:

{Number.From(List.Min(TweetProfile[Date]))..Number.From(List.Max(TweetProfile[Date]))}

Here is a breakdown of what the formula does:

For the dataset, we want the start date to reflect the earliest to latest date that we have in the data: May 9, 2022 - June 25, 2022.

Day Name column was inserted into the date table and renamed to DayOfTheWeek

The date table was named Calender.


Data Modeling

After the dataset was cleaned and transformed, it was ready to be modeled.

  • The Calender table was marked as the official date table in the dataset.
  • A one-to-many (*:1) relationship was created between the TweetStats and the Calender tables using the date column in each of the tables
  • A one-to-many (*:1) relationship was created between the TweetProfile and the Calender tables using the date column in each of the tables
  • A one-to-many (*:1) relationship was created between the UserProfile and the TweetStats tables using the UserID column in each of the tables
  • A one-to-many (*:1) relationship was created between the UserProfile and the TweetProfile tables using the UserID column in each of the tables
  • The realtioships formed in the data model is a Star Schema and is shown below:

Data Model


Data Visualization

Data visualization for the dataset was done in 3 folds using Microsoft Power BI Desktop:

  • The Content Analysis: Shows the tools by mention, word cloud, top active users, etc.
  • The Summary: Shows the total number of tweets, total number of users, tweet by day of the week, etc.
  • The Dashboard: Shows visualization from Content Analysis and Summary to provide answer to the Problem Statement.

Figure 1 shows visualizations from Content Analysis page

Figure 1
image

Figure 2 shows visualizations from Summary page

Figure 2
image

Figure 3 shows visualizations from Dashboard page

Figure 3
image

Data Analysis

Measures used in visualization are:

  • Total No. of Tweets = COUNT(TweetProfile[Tweet])
  • Total No. of Users = COUNT(UserProfile[user])
  • Total No. of Sources = COUNT(UserProfile[Source])
  • Average Engagement Rate = DIVIDE(Total No. of Users, Total No. of Tweets)

As shown from Data Visualization, It can be deduced that:

  • The were 680 tweets for the #NG30DaysOfLearning
  • The were about 234 users
  • The average engagement rate is 34%

Insights

As shown by Data Visualization, It can be deduced that:

  • The most active user of the #NG30DaysOfLearning is theoyinbooke with 110 tweets
  • The most mentioned tool is Github

Shareable Link

You can interact with the report here:

View Report

clean-and-analyze-social-media-usage-data-with-python's People

Contributors

boddepallyvenkatesh06 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.