Analyzing personal Web activity

4 min readMay 30, 2021

So recently due to covid-19 most of my work is on a computer and I wanted to ee how this has affected my personal health. I’ve been wanting to do this for a long time but did not have the expertise for that, however, I recently took a course called Zero to pandas, offered by jovian. So I finally decided to analyze my personal data and I’m going to try to make this a tutorial so you can also do this on your own data from Google.

To get your own data, go to Google Takeout then you may follow along

Getting History Data

The first step for any data analysis project is to clean the data, separate the useful from the unused data. For this demo, we shall be using pandas to load the JSON files into our program

with open("BrowserHistory.json") as f:
    data = json.loads(f.read())
    df = pd.DataFrame(data["Browser History"])

Let’s take a quick look at the data frame

Let's get rid of some of these columns like favicon_url and client_id as we don't need them in our analysis

df.drop('client_id', axis=1, inplace=True)
df.drop('favicon_url', axis=1, inplace=True)

Now let’s check some unique values for page transition as it seems to be containing the type of webpage that I visited

After looking at this data I have concluded that for this analysis I only need link, Reload, Generated, Typed. So let’s get rid of all other rows except the ones mentioned

Extracting new features:

So now that we have gotten rid of all other values in page_transition. We are ready to get some insights out of this data but before we start visualizing the data, we need to convert the time in ms into a pandas date-time object. This will make it easy for us to fetch a particular month, day or any such constrain of time. Also, note that this data is only 12 months. That is what Google provides us with.

Here’s a function that will help us convert the “time_usec” column to a readable format and add it as a separate column

import datetime
def time_converter(x):
    return datetime.datetime.fromtimestamp(x/1000000)df['date_time'] = df['time_usec'].apply(time_converter)

This is what it looks like.

Now to make things a little easier we will split the dates into different columns for a month, year and day

Now its a good idea to define a few functions which basically clean the data and make it a lot easier for us to work with.

Function 1: This function will return the domain in plain text out of a link

import tldextract

def return_domain(x):
    domain = tldextract.extract(x)[1]
    sub_domain = tldextract.extract(x)[0]
    if sub_domain == "mail":
        return sub_domain + "." + domain
    # To differentiate b/w drive.google.com and google.com
    if domain == "google" and sub_domain=="www": 
        return "google_search" 
    return domain

Function 2: Returns the category of a particular domain. I’ve separated them into learning, News read, social media, other

def return_category(x):
    if x in ["pluralsight", "w3schools", "geeks4geeks", "freecodecamp", "jovian", "stackoverflow", "kodekloud", "teachable", "pynative","realpython",]:
        return "Learning"
    elif x in ["9to5google", "theverge", "sciencedaily", "digitaltrends", "towardsdatascience", "geekblooging"]:
        return "Newsreads"
    elif x in ["youtube", "instagram", "facebook", "twitter", "pinterest", "discord", "whatsapp", "snapchat" ]:
        return "social media"
    else:
        return "Other"# Cluster popular domains into a category
df['domain'] = df['url'].apply(return_domain)
df['category'] = df['domain'].apply(return_category)

Exploratory Analysis and Visualization

from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

1. Most visited page transition:

plt.title("Distribution of pages")
plt.hist(df.page_transition,color='darkorange');

So you can see i mostly click on links which most ppl do so it isn’t a surprise. Clicking youtube recommendations, google search results and similar things.

2. Most active time:

df_heat = df.groupby(["hour", "day"])["url"].size().reset_index()
df_heat2 = df_heat.pivot("hour", "day", "url")
g = sns.heatmap(df_heat2, cmap='Blues')
g.invert_yaxis()
plt.show()

This shows most of my activity was between 7 AM to 9 PM. Which is true for most people. So our data does reflect the real world.

3. Category Pie chart

plt.figure(figsize = (5,5))
df['category'].value_counts().plot(kind='pie',autopct='%1.1f%%',shadow=True)
plt.show()

So looks like most of my searches are all over the place so I can't really categorize them, also I’m pretty sure I have a lot more websites I get my news from so it isn’t included

4. Total youtube links visited

df_youtube = df[(df['domain'] == "youtube") | df['year'] == 2020]
# Plots
plt.figure(figsize=(14,8))
plt.title("Youtube link visits")
sns.countplot(x='month',data=df_youtube, palette=['#432371',"#FAAE7B"]);

These are the No. of youtube links I have clicked during each month. June seems to have the least amount of youtube links. I’m assuming I probably used the app instead of the browser

Inferences and Conclusion

We just got a glance of what my searches look like, let’s see what we found:

Most of my web activity comes from links this confirms google’s claims about its algorithm
All my activity occurs between 7 am and 9 pm
Due to covid and work from home, my web activity has jumped up by
My web activity always increased and never took a dip