Clean-and-analyze-social-media-usage-data-with-Python

Problem Statement

The purpose of this analysis is to gain insights into the number of engagements the #NG30DaysofLearning has on twitter.

For this study we examined a variety of categories: the number of tweets, number of users, the most active users, the most mentioned tools e.t.c

Data Sourcing

The dataset used for this analysis was scrapped from twitter jupyter notebook
The preview of the dataset and python code is shown below:

!pip install snscrape

Requirement already satisfied: snscrape in c:\users\user\anaconda3\lib\site-packages (0.4.3.20220106)
Requirement already satisfied: lxml in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.8.0)
Requirement already satisfied: beautifulsoup4 in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.11.1)
Requirement already satisfied: requests[socks] in c:\users\user\anaconda3\lib\site-packages (from snscrape) (2.27.1)
Requirement already satisfied: filelock in c:\users\user\anaconda3\lib\site-packages (from snscrape) (3.6.0)
Requirement already satisfied: soupsieve>1.2 in c:\users\user\anaconda3\lib\site-packages (from beautifulsoup4->snscrape) (2.3.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2021.10.8)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.7.1)

import pandas as pd
import snscrape.modules.twitter as sntwitter

query = "(#30DaysOfLearning OR #NG30DaysOfLearning) until:2022-06-26 since:2022-05-05"
tweets = []
limit = 30000


for tweet in sntwitter.TwitterHashtagScraper(query).get_items():
    
    if len(tweets) == limit:
        break
    else:
        tweets.append([tweet.date, tweet.url, tweet.user.username, tweet.sourceLabel, tweet.user.location, tweet.content, tweet.likeCount, tweet.retweetCount,  tweet.quoteCount, tweet.replyCount])
        
df = pd.DataFrame(tweets, columns=['Date', 'TweetURL','User', 'Source', 'Location', 'Tweet', 'Likes_Count','Retweet_Count', 'Quote_Count', 'Reply_Count'])

df.to_csv('30DLTweets.csv')

df.head()

	Date	TweetURL	User	Source	Location	Tweet	Likes_Count	Reply_Count
0	2022-06-25 22:51:18+00:00	https://twitter.com/poetrineer/status/15408300...	poetrineer	Twitter for Android	Oyo, Nigeria	So as one of my commitment to document my lear...	0	1
1	2022-06-25 22:44:10+00:00	https://twitter.com/poetrineer/status/15408282...	poetrineer	Twitter for Android	Oyo, Nigeria	Finally, here is my updated COVID-19 Data Anal...	1	0
2	2022-06-25 19:25:58+00:00	https://twitter.com/MichealOjuri/status/154077...	MichealOjuri	Twitter Web App	Oyo, Nigeria	#30NGDaysOfLearning\n#30daysoflearning \n#micr...	0	0
3	2022-06-25 16:44:36+00:00	https://twitter.com/oye__aashu/status/15407377...	oye__aashu	Twitter for Android	Nainital, India	Day 4/ #30daysoflearning learned all about arr...	3	1
4	2022-06-25 12:49:02+00:00	https://twitter.com/hsb_data/status/1540678455...	hsb_data	Twitter Web App	New Jersey	Learning about sub queries on @DataCamp (SQL) ...	1	0

df.describe()

	Likes_Count	Retweet_Count	Quote_Count	Reply_Count
count	683.000000	683.000000	683.000000	683.000000
mean	15.780381	3.812592	0.185944	1.166911
std	41.164555	12.665561	0.737938	2.626554
min	0.000000	0.000000	0.000000	0.000000
25%	1.000000	0.000000	0.000000	0.000000
50%	2.000000	0.000000	0.000000	0.000000
75%	8.000000	2.000000	0.000000	1.000000
max	549.000000	248.000000	9.000000	29.000000

The dataset is also available at 30DLTweets

Data Preparation

Data transformation was done in Power Query and the dataset was loaded into Microsoft Power BI Desktop for modeling.

The Twitter Sentiment dataset is given by a table named:

30DLTweets which has 10 columns and 680 rows of observation

The tabulation below shows the 30DLTweets table with its column names and their description:

Column Name	Description
Date	Represents the date and time of tweet
TweetURL	Describes the tweet url
User	Describes the username of the user
Source	Descibes the device type of the user
Location	Describes the location of the user
Tweet	Describes the content of the tweet
Likes_Count	Represents the count of likes of the tweet
Retweet_Count	Represents the count of retweets of the tweet
Quote_Count	Represents the count of quote tweets on the tweet
Reply_Count	Represents the count of reply on the tweet

Data Cleaning for the dataset was done in power query as follows:

The 30DLTweets table was split into a dimension and 2 fact tables respectively, hence called:

UserProfile
TweetStats
TweetProfile

A calculated column UserID was created in each of the tables using the M-formula UserID = [User] & "_" & [Source]
Unnecessary columns were removed in each of the tables
Each of the columns in the tables were validated to have the correct data type

To ensure the accuracy of the dates in the Date column of TweetStats and TweetProfile tables, a date table was created for referencing using the M-formula:

{Number.From(List.Min(TweetProfile[Date]))..Number.From(List.Max(TweetProfile[Date]))}

Here is a breakdown of what the formula does:

For the dataset, we want the start date to reflect the earliest to latest date that we have in the data: May 9, 2022 - June 25, 2022.

Day Name column was inserted into the date table and renamed to DayOfTheWeek

The date table was named Calender.

Data Modeling

After the dataset was cleaned and transformed, it was ready to be modeled.

The Calender table was marked as the official date table in the dataset.
A one-to-many (*:1) relationship was created between the TweetStats and the Calender tables using the date column in each of the tables
A one-to-many (*:1) relationship was created between the TweetProfile and the Calender tables using the date column in each of the tables
A one-to-many (*:1) relationship was created between the UserProfile and the TweetStats tables using the UserID column in each of the tables
A one-to-many (*:1) relationship was created between the UserProfile and the TweetProfile tables using the UserID column in each of the tables
The realtioships formed in the data model is a Star Schema and is shown below:

Data Visualization

Data visualization for the dataset was done in 3 folds using Microsoft Power BI Desktop:

The Content Analysis: Shows the tools by mention, word cloud, top active users, etc.
The Summary: Shows the total number of tweets, total number of users, tweet by day of the week, etc.
The Dashboard: Shows visualization from Content Analysis and Summary to provide answer to the Problem Statement.

Figure 1 shows visualizations from Content Analysis page

Figure 1

Figure 2 shows visualizations from Summary page

Figure 2

Figure 3 shows visualizations from Dashboard page

Figure 3

Data Analysis

Measures used in visualization are:

Total No. of Tweets = COUNT(TweetProfile[Tweet])
Total No. of Users = COUNT(UserProfile[user])
Total No. of Sources = COUNT(UserProfile[Source])
Average Engagement Rate = DIVIDE(Total No. of Users, Total No. of Tweets)

As shown from Data Visualization, It can be deduced that:

The were 680 tweets for the #NG30DaysOfLearning
The were about 234 users
The average engagement rate is 34%

Insights

As shown by Data Visualization, It can be deduced that:

The most active user of the #NG30DaysOfLearning is theoyinbooke with 110 tweets
The most mentioned tool is Github

Shareable Link

You can interact with the report here:

View Report

boddepallyvenkatesh06 / clean-and-analyze-social-media-usage-data-with-python Goto Github PK