A Book Recommender System based on an Exploratory Data Analysis on 52,000 books

Language Used: Python,

Libraries Used: pandas, numPy, matplotlib, seaborn

As a reader(and someone who is very picky when it comes to books), I often struggle while determining my next read. Of course, there are various sites that suggest books and share book reviews and ratings, but it can get a bit challenging to choose a book just from those text reviews. The key points we are looking for may not be immediately apparent and this often ends up creating more dilemmas. For instance, if you wanted a list of books under the fiction and fantasy genre, with the highest average ratings, also liked by at least 75% of the readers, how would you search for it? Or a list of self-help books, particularly after the year 2000, also listed in the Best Books ever with top scores? Or if you are looking for something cheap as well as good, or books that are not too long so you can finish them in an hour(or less)? I mean the list goes on and on and on. So wouldn't it be much easier if you could simply look into different visual graphs and use it as a guide to choose a book as per your preferences?

Keeping these things in mind among many other things, I decided to perform an exploratory data analysis on the data of 'GoodReads Best Books Ever list'. The dataset I am using for this contains the records of approximately 52,000 different books, out of which around 36,000 books are written in the English language! So, by using a variety of graphs to show the relationship between different key elements(genre, ratings, number of pages, bbe scores, liked Percentage, publish date, among many other elements), I will provide you with insights and help you make better and more objective decisions.

You can simply think of this as a book recommender system driven by data. As you read through, you will notice that I have also put up download links to the books(the ones I could find) so you can download them directly to your device!

"Sometimes, you read a book and it fills you with this weird evangelical zeal, and you become convinced that the shattered world will never be put back together unless and until all living humans read the book."

-John Green

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from matplotlib import font_manager as fm, rcParams


plt.rcParams['font.sans-serif']=['SimHei'] 


#Specify the default font
plt.rcParams['font.sans-serif'] = ['SimHei'] #Show Chinese labels
plt.rcParams['axes.unicode_minus']=False #Solve the problem that the minus sign '-' is displayed as a square
plt.rcParams['font.family']='sans-serif'
plt.rcParams["font.family"] = "Phetsarath OT"

 
missing_values = ['na','NA', 'N/A', 'N/a', 'not available', 'missing', np.nan] 
df = pd.read_csv('https://github.com/scostap/goodreads_bbe_dataset/blob/main/Best_Books_Ever_dataset/books_1.Best_Books_Ever.csv?raw=true', error_bad_lines = False, na_values = missing_values)

#fxn to remove duplicates
def remove_dup(dataframe):
    org_df = dataframe.drop_duplicates()
    return org_df

def clean_data(dataframe):
    a = remove_dup(dataframe) #using the fxn to remove duplicates
    a = a.dropna(how = 'all') #drop the records that have ALL na values
    a = a.dropna(axis=1, how='all') #drop the columns that have ALL na values
    return a

clean_df = clean_data(df)

req_dff = clean_df[['title', 'author', 'rating', 'language', 'genres', 'pages', 'publishDate','awards', 'numRatings', 'likedPercent','setting','bbeScore', 'bbeVotes','description','price' ]]


searchfor = ['complete', 'collection', 'set', 'Collection', 'Complete', 'Set', 'Bundle', 'bundle', 'collective', 'Piano Solos', 'One Direction', 'Bible', 'bible', 'Church'] #removing collections/book bundles
solo_df = req_dff[~req_dff.title.str.contains('|'.join(searchfor))]

nextsearch = ['Anonymous', 'anonymous']
solo_df = solo_df[~req_dff.author.str.contains('|'.join(nextsearch))]


solo_df['publishDate'] = pd.to_datetime(solo_df['publishDate'], errors='coerce') #converting publishDate to datetime object


solo_df['pages'].replace('1 page', np.nan, inplace = True) #removing specific term
solo_df['pages'] =solo_df['pages'].astype(float) #converting pages to float dtype

solo_df = solo_df.loc[solo_df['language'] == 'English'] #only the books in English language
solo_df = solo_df.loc[solo_df['pages'] > 0] #books with atleast one page
solo_df = solo_df.loc[solo_df['numRatings'] > 100] #books with atleast 100 ratings
solo_df = solo_df.loc[solo_df['likedPercent'] > 0] #books that atleast 0.1% people liked
solo_df = solo_df.drop('language', axis = 1)#dropping the language column
# solo_df.set_index('publishDate', inplace =True) #setting date as the index

#fxn to convert string to list
def str_to_list(str):
    lst = str.split(',')
    return [x.replace("'", "").replace("[", "").replace("]","").strip(' ') for x in lst]

#fxn to create a new dataframe taking the list as genre
def b_n_t():
    column_names = ['title', 'author', 'rating', 'genre', 'pages', 'publishDate','awards', 'numRatings', 'likedPercent','setting','bbeScore', 'bbeVotes','description','price']

    new_df = pd.DataFrame(columns = column_names)
    for index, row in solo_df.iterrows():
        p_g_lst = str_to_list(row[3])
        new_row = {'title':row[0],  'author': row[1], 'rating': row[2], 'genre':p_g_lst, 'pages': row[4], 'publishDate': row[5], 'awards': row[6], 'numRatings': row[7], 'likedPercent': row[8],'setting': row[9],'bbeScore': row[10], 'bbeVotes': row[11],'description': row[12],'price': row[13]}
        new_df = new_df.append(new_row, ignore_index=True)
    return new_df

list_df = b_n_t()
list_df['author'] = list_df['author'].str.split(',').str[0]
list_df['title_by_author'] = list_df['title'] + ' BY ' + list_df['author']
list_df

<ipython-input-13-7b62ac415330>:43: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  solo_df = solo_df[~req_dff.author.str.contains('|'.join(nextsearch))]