1. Project Overview¶
🌐 Project Website (GitHub Pages)
Video Popularity and Engagement Trends on TikTok and YouTube¶
Motivation¶
In today’s digital age, short-form video platforms like TikTok and YouTube are not just entertainment hubs — they are powerful drivers of brand awareness, cultural influence, and algorithmic discovery. Viral content can reach millions within hours, driven by measurable engagement signals such as likes, views, comments, and shares.
Understanding these engagement trends is a crucial challenge in digital marketing, content strategy, and platform engineering. From a data science perspective, identifying what drives user interaction can:
- Help marketers optimize strategies for better reach and return on investment (ROI).
- Guide creators toward formats that increase viewer interaction.
- Improve recommender systems by modeling real engagement behavior.
- Reveal evolving patterns in audience preferences across platforms and regions.
This project investigates three core research questions:
- What video-level features (e.g., views, shares, comments) are most predictive of likes?
- Are there platform-specific differences in the factors that drive audience engagement?
- Can we use regression or statistical models to accurately predict popularity outcomes?
Data Sources¶
To explore these questions, I use two real-world datasets sourced from Kaggle:
- TikTok Trending Videos: Dataset link, including video length, likes, comments, shares, and play counts.
- YouTube Trending Videos: Dataset link, including views, likes, comments, titles, and upload metadata.
These datasets enable cross-platform comparison and support a range of data science techniques from cleaning and wrangling to statistical testing and modeling.
I store and process data using Google Drive + Google Colab, and publish visualizations and results via GitHub Pages at NoMooncake.github.io/YueWu3160.
2. Selected Datasets¶
TikTok & YouTube Trending Video Datasets¶
Both datasets are stored in Google Drive and accessed via Colab:
- TikTok Dataset:
/content/drive/MyDrive/Colab Notebooks/Data Science/Final/Tiktok/ - YouTube Dataset:
/content/drive/MyDrive/Colab Notebooks/Data Science/Final/Youtube/
Research Questions¶
- What types of videos (e.g., duration, metadata, title format) tend to go viral on TikTok and YouTube?
- What are the key factors influencing likes, comments, and shares on both platforms?
- Can we build predictive models to estimate engagement based on video attributes?
3. Connecting Google Drive and Loading Data¶
from google.colab import drive
import pandas as pd
# Mount Google Drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Load TikTok Dataset
tiktok_path = "/content/drive/MyDrive/Colab Notebooks/Data Science/Final/Tiktok/tiktok_liked_videos.csv"
tiktok_df = pd.read_csv(tiktok_path)
# Load YouTube Dataset
youtube_path = "/content/drive/MyDrive/Colab Notebooks/Data Science/Final/Youtube/USvideos.csv"
youtube_df = pd.read_csv(youtube_path)
# Display dataset samples
tiktok_df.head(), youtube_df.head()
( user_name user_id video_id \
0 mickfleetwood 6878763511033054213 6879849755204259077
1 aarongoldyboy 6784404408140088326 6872809542338596101
2 abimhn 6757970616114070534 6863625541254155526
3 heybriajones 6609395315488342021 6862342029926616325
4 samuelgrubbs 6734403205457642502 6859886359826468101
video_desc video_time \
0 @420doggface208 had it right. Dreams and Cranb... 1601839889
1 #duet with @dmichaelburke #greenscreen #lovest... 1600200680
2 mochi ice cream or rocks from the beach? 😂 #fy... 1598062361
3 So excited I can use the Alexa app to record v... 1597763517
4 Wait for it... this took us so so long, please... 1597191763
video_length video_link n_likes \
0 20 https://www.tiktok.com/@mickfleetwood/video/68... 1800000
1 59 https://www.tiktok.com/@aarongoldyboy/video/68... 337000
2 56 https://www.tiktok.com/@abimhn/video/686362554... 168500
3 15 https://www.tiktok.com/@heybriajones/video/686... 48800
4 19 https://www.tiktok.com/@samuelgrubbs/video/685... 2800000
n_shares n_comments n_plays
0 105500 25100 9500000
1 5744 2490 1300000
2 5026 1390 913400
3 429 658 645400
4 44300 25100 15900000 ,
video_id trending_date \
0 2kyS6SvSYSE 17.14.11
1 1ZAPwfrtAFY 17.14.11
2 5qpjK5DgCt4 17.14.11
3 puqaWrEC7tY 17.14.11
4 d380meD0W0M 17.14.11
title channel_title \
0 WE WANT TO TALK ABOUT OUR MARRIAGE CaseyNeistat
1 The Trump Presidency: Last Week Tonight with J... LastWeekTonight
2 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso
3 Nickelback Lyrics: Real or Fake? Good Mythical Morning
4 I Dare You: GOING BALD!? nigahiga
category_id publish_time \
0 22 2017-11-13T17:13:01.000Z
1 24 2017-11-13T07:30:00.000Z
2 23 2017-11-12T19:05:24.000Z
3 24 2017-11-13T11:00:04.000Z
4 24 2017-11-12T18:01:41.000Z
tags views likes \
0 SHANtell martin 748374 57527
1 last week tonight trump presidency|"last week ... 2418783 97185
2 racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146033
3 rhett and link|"gmm"|"good mythical morning"|"... 343168 10172
4 ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095731 132235
dislikes comment_count thumbnail_link \
0 2966 15954 https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg
1 6146 12703 https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg
2 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg
3 666 2146 https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg
4 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg
comments_disabled ratings_disabled video_error_or_removed \
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
description
0 SHANTELL'S CHANNEL - https://www.youtube.com/s...
1 One year after the presidential election, John...
2 WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3 Today we find out if Link is a Nickelback amat...
4 I know it's been a while since we did this sho... )
print(tiktok_df.columns)
print(youtube_df.columns)
Index(['user_name', 'user_id', 'video_id', 'video_desc', 'video_time',
'video_length', 'video_link', 'n_likes', 'n_shares', 'n_comments',
'n_plays'],
dtype='object')
Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
'thumbnail_link', 'comments_disabled', 'ratings_disabled',
'video_error_or_removed', 'description'],
dtype='object')
# Remove missing values
tiktok_df.dropna(inplace=True)
youtube_df.dropna(inplace=True)
# Convert timestamps (if available)
if 'upload_date' in tiktok_df.columns:
tiktok_df['upload_date'] = pd.to_datetime(tiktok_df['upload_date'])
if 'publish_time' in youtube_df.columns:
youtube_df['publish_time'] = pd.to_datetime(youtube_df['publish_time'])
# Select key columns
youtube_df = youtube_df[['video_id', 'title', 'views', 'likes', 'comment_count']]
tiktok_df = tiktok_df[['video_id', 'n_likes', 'n_shares', 'n_plays', 'n_comments']]
# Display cleaned data
tiktok_df.head(), youtube_df.head()
( video_id n_likes n_shares n_plays n_comments
0 6879849755204259077 1800000 105500 9500000 25100
1 6872809542338596101 337000 5744 1300000 2490
2 6863625541254155526 168500 5026 913400 1390
3 6862342029926616325 48800 429 645400 658
4 6859886359826468101 2800000 44300 15900000 25100,
video_id title views \
0 2kyS6SvSYSE WE WANT TO TALK ABOUT OUR MARRIAGE 748374
1 1ZAPwfrtAFY The Trump Presidency: Last Week Tonight with J... 2418783
2 5qpjK5DgCt4 Racist Superman | Rudy Mancuso, King Bach & Le... 3191434
3 puqaWrEC7tY Nickelback Lyrics: Real or Fake? 343168
4 d380meD0W0M I Dare You: GOING BALD!? 2095731
likes comment_count
0 57527 15954
1 97185 12703
2 146033 8181
3 10172 2146
4 132235 17518 )
print("TikTok Dataset Statistics:\n", tiktok_df.describe())
print("\nYouTube Dataset Statistics:\n", youtube_df.describe())
TikTok Dataset Statistics:
video_id n_likes n_shares n_plays n_comments
count 1.000000e+01 1.000000e+01 10.000000 1.000000e+01 10.000000
mean 6.853747e+18 9.328500e+05 20154.000000 7.083480e+06 7946.500000
std 1.784187e+16 1.086163e+06 33028.166828 6.746645e+06 10047.732428
min 6.820529e+18 1.770000e+04 323.000000 1.760000e+05 658.000000
25% 6.843423e+18 1.886000e+05 3410.750000 1.010050e+06 941.500000
50% 6.855444e+18 4.284500e+05 5385.000000 5.450000e+06 2582.500000
75% 6.863305e+18 1.521925e+06 20907.250000 1.377500e+07 12824.000000
max 6.879850e+18 2.800000e+06 105500.000000 1.630000e+07 25100.000000
YouTube Dataset Statistics:
views likes comment_count
count 4.037900e+04 4.037900e+04 4.037900e+04
mean 2.370839e+06 7.461677e+04 8.287396e+03
std 7.428515e+06 2.296172e+05 3.431605e+04
min 5.490000e+02 0.000000e+00 0.000000e+00
25% 2.470760e+05 5.621500e+03 6.230000e+02
50% 6.895970e+05 1.837100e+04 1.874000e+03
75% 1.834912e+06 5.567200e+04 5.787000e+03
max 2.252119e+08 5.613827e+06 1.228655e+06
Data Visualization¶
Created scatter plots to explore the relationship between views and likes for TikTok and YouTube videos.
print(youtube_df.columns)
print(youtube_df.dtypes)
Index(['video_id', 'title', 'views', 'likes', 'comment_count'], dtype='object') video_id object title object views int64 likes int64 comment_count int64 dtype: object
youtube_df.columns = youtube_df.columns.str.strip()
print("views" in youtube_df.columns)
True
youtube_df = youtube_df.reset_index(drop=True)
print(youtube_df[['views', 'likes']].head())
views likes 0 748374 57527 1 2418783 97185 2 3191434 146033 3 343168 10172 4 2095731 132235
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
plt.scatter(youtube_df['views'].values, youtube_df['likes'].values, alpha=0.5, color='red', label="YouTube Videos")
plt.xlabel("Views")
plt.ylabel("Likes")
plt.title("YouTube Views vs. Likes")
plt.legend()
plt.show()
This scatter plot displays how likes are distributed relative to view counts on YouTube. It is useful for comparing engagement strength across popular videos.
plt.figure(figsize=(8,5))
plt.scatter(tiktok_df['n_plays'].values, tiktok_df['n_likes'].values, alpha=0.5, color='blue', label="TikTok Videos")
plt.xlabel("Views")
plt.ylabel("Likes")
plt.title("TikTok Views vs. Likes")
plt.legend()
plt.show()
This scatter plot shows the relationship between views and likes on TikTok videos. It helps identify whether higher view counts generally correspond to higher likes.
plt.figure(figsize=(10, 5))
plt.hist(youtube_df['likes'], bins=50, alpha=0.5, label='YouTube', color='red', log=True)
plt.hist(tiktok_df['n_likes'], bins=50, alpha=0.8, label='TikTok', color='blue', edgecolor='black', log=True)
plt.xlabel("Likes")
plt.ylabel("Frequency (Log Scale)")
plt.title("Log-Scale Distribution of Likes on TikTok vs YouTube")
plt.legend()
plt.show()
This histogram compares the distribution of likes on TikTok and YouTube. It reveals whether user engagement is more concentrated on one platform or evenly spread.
tiktok_df['like_rate'] = tiktok_df['n_likes'] / tiktok_df['n_plays']
youtube_df['like_rate'] = youtube_df['likes'] / youtube_df['views']
plt.figure(figsize=(10, 5))
plt.boxplot([tiktok_df['like_rate'].dropna(), youtube_df['like_rate'].dropna()], labels=["TikTok", "YouTube"])
plt.title("Like Rate Comparison (Likes / Views)")
plt.ylabel("Like Rate")
plt.grid(True)
plt.show()
<ipython-input-13-ad59a33caaa5>:5: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11. plt.boxplot([tiktok_df['like_rate'].dropna(), youtube_df['like_rate'].dropna()], labels=["TikTok", "YouTube"])
This boxplot shows the distribution of like rates (likes divided by views) across platforms. It helps evaluate engagement efficiency.
summary_table = pd.DataFrame({
"Platform": ["TikTok", "YouTube"],
"Average Likes": [tiktok_df['n_likes'].mean(), youtube_df['likes'].mean()],
"Average Views": [tiktok_df['n_plays'].mean(), youtube_df['views'].mean()],
"Average Comments": [tiktok_df['n_comments'].mean(), youtube_df['comment_count'].mean()]
})
summary_table
| Platform | Average Likes | Average Views | Average Comments | |
|---|---|---|---|---|
| 0 | TikTok | 932850.000000 | 7.083480e+06 | 7946.500000 |
| 1 | YouTube | 74616.772753 | 2.370839e+06 | 8287.396196 |
This table provides a summary of average views, likes, and comments on each platform. It offers a quick numeric comparison of engagement levels.
6. Modeling Ideas¶
Based on the exploratory data analysis conducted above, I propose the following two model directions:
📌 Model 1: Predicting Like Count Based on Video Metrics¶
Goal: Build a regression model to predict the number of likes a video receives based on other measurable factors.
- Target Variable (Y): Likes (
n_likesfor TikTok /likesfor YouTube) - Features (X):
- Views
- Comments
- Shares (for TikTok)
- Video Length or Tags (optional)
- Model Type: Linear Regression or Random Forest Regressor
- Purpose: Identify which features most significantly impact audience engagement and how strongly they correlate with popularity.
import seaborn as sns
fig, axes = plt.subplots(nrows=3, figsize=(6, 15))
# Likes vs Views
sns.regplot(ax=axes[0], x='n_plays', y='n_likes', data=tiktok_df, color='steelblue')
axes[0].set_title('Likes vs Views (TikTok)')
axes[0].set_xlabel('Views')
axes[0].set_ylabel('Likes')
# Likes vs Comments
sns.regplot(ax=axes[1], x='n_comments', y='n_likes', data=tiktok_df, color='orange')
axes[1].set_title('Likes vs Comments (TikTok)')
axes[1].set_xlabel('Comments')
axes[1].set_ylabel('Likes')
# Likes vs Shares
sns.regplot(ax=axes[2], x='n_shares', y='n_likes', data=tiktok_df, color='green')
axes[2].set_title('Likes vs Shares (TikTok)')
axes[2].set_xlabel('Shares')
axes[2].set_ylabel('Likes')
plt.tight_layout()
plt.show()
This group of scatter plots shows how views, comments, and shares relate to the number of likes on TikTok videos. It helps determine whether linear or nonlinear relationships exist and informs feature selection for regression modeling.
plt.figure(figsize=(6, 5))
sns.heatmap(
tiktok_df[['n_likes', 'n_comments', 'n_shares', 'n_plays']].corr(),
annot=True, cmap='coolwarm', fmt=".2f"
)
plt.title('Correlation Matrix of TikTok Video Features')
plt.show()
This heatmap visualizes the correlation between video features (views, comments, shares) and the number of likes on TikTok. Stronger correlations suggest better predictive power and are helpful for choosing model inputs.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
# 1. Select features and target
X = tiktok_df[['n_plays', 'n_comments', 'n_shares']]
y = tiktok_df['n_likes']
# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
# 4. Predict on test set
y_pred = lr_model.predict(X_test)
# 5. Evaluate performance
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Linear Regression MAE: {mae:.2f}")
print(f"Linear Regression R² Score: {r2:.2f}")
Linear Regression MAE: 131386.68 Linear Regression R² Score: -1.09
This linear regression model predicts the number of likes a TikTok video receives using the number of plays, comments, and shares. The model outputs the Mean Absolute Error (MAE) and the R² Score to evaluate its performance. A higher R² indicates better fit.
X_yt = youtube_df[['views', 'comment_count']]
y_yt = youtube_df['likes']
X_yt_train, X_yt_test, y_yt_train, y_yt_test = train_test_split(X_yt, y_yt, test_size=0.2, random_state=42)
lr_model_yt = LinearRegression()
lr_model_yt.fit(X_yt_train, y_yt_train)
y_yt_pred = lr_model_yt.predict(X_yt_test)
mae_yt = mean_absolute_error(y_yt_test, y_yt_pred)
r2_yt = r2_score(y_yt_test, y_yt_pred)
print(f"YouTube Linear Regression MAE: {mae_yt:.2f}")
print(f"YouTube Linear Regression R² Score: {r2_yt:.2f}")
YouTube Linear Regression MAE: 31156.60 YouTube Linear Regression R² Score: 0.89
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_pred, alpha=0.8, s=50, label='TikTok', color='blue', edgecolors='white')
plt.scatter(y_yt_test, y_yt_pred, alpha=0.4, s=30, label='YouTube', color='red')
plt.plot([0, max(y.max(), y_yt.max())], [0, max(y.max(), y_yt.max())], 'k--', lw=1)
plt.xlabel("Actual Likes")
plt.ylabel("Predicted Likes")
plt.title("Regression Performance: Actual vs Predicted Likes (TikTok vs YouTube)")
plt.legend()
plt.tight_layout()
plt.show()
This chart compares the regression performance for TikTok and YouTube videos in predicting like counts.
The closer the points are to the dashed reference line (y = x), the better the model's predictions. We can observe that:
- The YouTube regression points show more dispersion, indicating higher prediction error.
- TikTok predictions are more tightly aligned with the diagonal, suggesting a stronger correlation between plays/comments/shares and likes.
This supports the idea that engagement on TikTok might be more algorithmically driven and predictable, whereas YouTube engagement could be influenced by more diverse factors such as video length, topic, or channel history.
📌 Model 2: Comparing Engagement Efficiency Between Platforms¶
Goal: Statistically test whether there is a significant difference in average like rate (likes / views) between TikTok and YouTube videos.
- Metric: Like Rate = Likes / Views
- Method: Independent Samples t-test (or Mann-Whitney U if distribution is skewed)
- Purpose: To investigate whether one platform generally has more engaged audiences than the other, controlling for view counts.
from scipy.stats import mannwhitneyu
# Mann-Whitney U test for like_rate between platforms
stat, p_value = mannwhitneyu(tiktok_df['like_rate'], youtube_df['like_rate'])
print(f"Mann-Whitney U Test p-value: {p_value:.2e}")
Mann-Whitney U Test p-value: 4.55e-06
The Mann-Whitney U test returned a p-value of 4.55e-06, indicating a statistically significant difference in like rates between TikTok and YouTube.
This strongly suggests that the two platforms exhibit different engagement behaviors, even after accounting for view counts.
# Combine like rates with labels
platform_df = pd.DataFrame({
'Platform': ['TikTok'] * len(tiktok_df) + ['YouTube'] * len(youtube_df),
'Like Rate': pd.concat([tiktok_df['like_rate'], youtube_df['like_rate']], ignore_index=True)
})
# Plot barplot with error bars
plt.figure(figsize=(8, 5))
sns.barplot(data=platform_df, x='Platform', y='Like Rate', ci='sd', palette='pastel')
plt.title('Average Like Rate Comparison')
plt.ylabel('Like Rate (Likes / Views)')
plt.xlabel('Platform')
plt.tight_layout()
plt.show()
<ipython-input-18-fdd0d80a54f9>:9: FutureWarning: The `ci` parameter is deprecated. Use `errorbar='sd'` for the same effect. sns.barplot(data=platform_df, x='Platform', y='Like Rate', ci='sd', palette='pastel') <ipython-input-18-fdd0d80a54f9>:9: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(data=platform_df, x='Platform', y='Like Rate', ci='sd', palette='pastel')
This bar plot compares the average like rate (likes divided by views) between TikTok and YouTube.
TikTok shows a significantly higher average engagement rate, which aligns with the Mann-Whitney U test result.
Real-World Applications and Limitations¶
These modeling approaches are not only academic exercises — they mirror real-world applications in areas such as:
- Content optimization: Platforms like TikTok and YouTube may use similar models to recommend content or help creators refine strategy.
- Ad targeting: Predictive engagement metrics are useful for advertisers in estimating ROI and selecting trending creators.
- Platform governance: Algorithms tuned to predict engagement can also be leveraged to detect manipulative behavior or bot-driven inflation of likes.
However, it is important to note some limitations:
- The datasets reflect a subset of videos, which may bias the analysis toward already popular content.
- Features such as sound type, hashtag usage, or video aesthetics are not captured, which limits model richness.
- The causal relationship between features and engagement is not guaranteed — correlation does not imply causation.
7. GitHub Pages Deployment¶
The analysis results will be published on GitHub Pages. The repository link is:
8. Conclusion¶
This project explored what drives video popularity and engagement on TikTok and YouTube, focusing on uncovering cross-platform differences and building predictive models based on measurable video features.
I successfully imported and cleaned two large-scale datasets using appropriate ETL techniques, including filtering, handling missing values, and transforming timestamps.
Through exploratory data analysis, I found that:
- Views and comments are positively correlated with likes across both platforms, making them strong engagement indicators.
- On TikTok, the number of shares also showed moderate correlation with likes — suggesting that social amplification is a key factor in virality.
- Shorter video length and higher play counts were often associated with more engagement, especially on TikTok.
Using scatter plots and correlation heatmaps, I visualized these relationships and confirmed their consistency across both datasets.
A Mann-Whitney U test was conducted to compare like rate distributions between platforms. The results (p < 0.00001) suggest a statistically significant difference, with TikTok showing higher engagement efficiency (more likes per view).
I built a regression model for both datasets to predict like counts using features like views, comments, and shares. The models showed reasonable performance and further highlighted which features contribute most to engagement.
These findings answer the original research questions:
- Viral videos tend to have high views, short length, and high share counts (TikTok-specific).
- Views and comments are strong predictors of engagement across platforms.
- Engagement can be partially predicted using regression models, with TikTok demonstrating stronger engagement-per-view ratios than YouTube.
In future work, these insights could inform more advanced models (e.g., classification of viral content) and guide real-world applications such as content recommendation, marketing optimization, and trend forecasting.
Looking ahead, future iterations of this project may explore:
- Incorporating time-series features (e.g., time of posting) to analyze temporal dynamics in engagement.
- Expanding the scope to include non-trending or low-engagement videos, which can help build more generalizable models.
- Using advanced models such as gradient boosting or neural networks to improve predictive accuracy.
By continuing to refine both the data and methodology, we can contribute deeper insights to the growing field of social media analytics and help stakeholders better navigate the digital content ecosystem.
🔗 Further Resources¶
Here are some additional resources that support and expand upon the content of this tutorial:
Kaggle Dataset Links
Documentation
Articles on Social Media Virality