Data Cleaning and Exploration
Data can be messy, especially when manual data input is involved. Data cleaning was the first task that became easier after I got started with Python.
In this project, I will leverage Python’s Pandas and NumPy libraries to clean and explore a dataset.
By creating a spider with the Scrapy module, I obtained information about all study programmes in the Netherlands from the website kiesjestudie.nl. The script for retrieving the data is available in this project: Web scraping using Scrapy spiders.
The dataset contains many useful information, however, it needs to be cleaned before further analysis. The project contains the following three stages:
- Cleaning the original dataset: checking for missing values, setting the right data types, separating columns
- Merging datasets: combining data from multiple sources in order to obtain more information
- Exploring and visualizing the dataset: understanding the data and viewing the most relevant features
Let’s start by importing the relevant modules and exploring the data!
import pandas as pd
import numpy as np
df = pd.read_csv('kiesjestudies_dataset.csv', encoding='latin-1', header = None)
df.head()
df.info()
First steps:
At first glance, we can already discover some issues with the dataset such as missing column names, wrong data formats and unnecessary characters. So first we need to:
- add column names
- replace quotation marks in the first column
- replace missing score values (marked as “…”) with NaN
- change the data type of the score column from string to float
- create multiple columns for chategorical data
column_list = ['study_programme', 'level', 'university', 'score', 'language', 'branch', 'city']
df.columns = column_list
df['study_programme'] = df['study_programme'].str.replace("\"", "")
df['score'].replace('...', np.NaN, inplace = True)
df['level'].unique()
new = df["level"].str.split("(", n = 1, expand = True)
df['degree']= new[0]
df['type']= new[1]
df['type'] = df['type'].str.replace(")", "")
The type and the language columns contain chategorical data. There are multiple ways to map chategorical data. In this project, I will separate columns for the different chategories with 0 and 1 values.
df['full_time'] = df['type'].str.contains('voltijd').astype(int)
df['part_time'] = df['type'].str.contains('deeltijd').astype(int)
df['dual'] = df['type'].str.contains('duaal').astype(int)
df['language'].unique()
df['dutch'] = df['language'].str.contains('Nederlands').astype(int)
df['english'] = df['language'].str.contains('Engels').astype(int)
df['german'] = df['language'].str.contains('Duits').astype(int)
df.drop(columns = ['level', 'type', 'language'], inplace = True)
df['score'].replace('...', np.NaN, inplace = True)
df['score'] = df['score'].str.replace(',', '.')
df['score'] = df['score'].astype(float)
df.head()
df.dtypes
Merging datasets
After cleaning, the dataset provides a good overview of the HBO study programmes in the Netherlands with information about the university, score, branch, degree, type and language. It can be further broadened by adding information from other datasets. I will merge the original dataset with two more datasets in order to get an even better picture of the study programmes.
1) Information about the province in which the university is located: the dataset with the information was obtained by scraping a WikiPedia article (see script below) with the help of Python’s BeautifulSoup library.
2) Information about the number of students attending a university: the dataset was obtained from the website of Dienst Uitvoering Onderwijs (an institution in charge of the study finances in the Netherlands).
Both new datasets need to be celaned before mergin it with the original.
First dataset
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://nl.wikipedia.org/wiki/Lijst_van_Nederlandse_gemeenten_per_provincie'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
regions = page_soup.findAll("span", {'class': 'mw-headline'})[0:12]
file = 'dutch_provinces_and_cities.csv'
f = open(file, 'w')
headers = 'province,city\n'
f.write(headers)
for region in range(len(regions)):
province = regions[region].span.text.strip()
ol = page_soup.findAll('h2')[region + 1].find_next('ol')
cities = ol.findAll("span", {'class': 'sorttext'})
for city in cities:
f.write(province + ',' + city.text.strip() + '\n')
f.close()
provinces = pd.read_csv('dutch_provinces_and_cities.csv', encoding='latin-1')
new = pd.merge(df, provinces, on='city', how = 'left')
new.head()
missing = new[new['province'].isnull()]
missing['city'].unique()
new.loc[new.city == 'Panningen', 'province'] = "Limburg"
new.loc[new.city == 'Doorn', 'province'] = "Utrecht"
new.loc[new.city == 'Sittard', 'province'] = "Limburg"
new.loc[new.city == 'Velp', 'province'] = "Limburg"
new.loc[new.city == 'Sittard', 'province'] = "Gelderland"
new.loc[new.city == 'Veghel', 'province'] = 'Noord-Brabant'
new.loc[new.city == 'Driebergen', 'province'] = "Utrecht"
new.loc[new.city == 'Den Bosch', 'province'] = 'Noord-Brabant'
new['province'].unique()
new['province'] = new['province'].astype(str)
new['province'] = new.province.apply(lambda x: x.replace(' (provincie)',''))
new.head()
Second dataset
numbers = pd.read_excel('01-ingeschrevenen-hbo-2018.xlsx')
numbers.head()
students = numbers.groupby('INSTELLINGSNAAM ACTUEEL')['2018 Man', '2018 Vrouw'].sum()
full = pd.merge(new, students, left_on='university', right_on = 'INSTELLINGSNAAM ACTUEEL', how = 'left')
full.head()
university_list = students.index.tolist()
for l in new['university']:
for n in university_list:
if n in l:
new['university'].replace(l, n, inplace = True)
full = pd.merge(new, students, left_on='university', right_on = 'INSTELLINGSNAAM ACTUEEL', how = 'left')
full.head()
full.info()
full['Total'] = full[['2018 Man', '2018 Vrouw']].sum(axis = 1, skipna = True)
full.rename(columns = {'2018 Man' : 'Male', '2018 Vrouw': 'Female'}, inplace = True)
full.to_csv('complete_university_dataset.csv')
Exploring the data
Data exploration helps to understand the dataset and identify the most relevant variables before the analysis. Besides cleaning the data, Python’s Pandas library is also useful for cleaning and filtering. I will use three additional libraries for data visualization: Matplotlib, Seaborn and Folium.
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 6)
plt.style.use('ggplot')
import seaborn as sns
import folium
full.head()
Based on the dataset from www.kiesjestudies.nl, there are 1210 HB study programmes in the Netherlands. In this project, I compare them along 16 dimensions.
full.shape
full.describe()
Zuid-Holland province has the most universities: 23 universities offer 114 HBO study programmes. Noord-Brabant has the highest number of study programmes offered by 14 universities.
by_province = full.groupby('province')['university', 'study_programme'].nunique().sort_values('university', ascending = False)
by_province
ax = by_province.plot(kind = 'bar', color = ['#ff9933', '#6ec1e4'])
for p in ax.patches:
ax.annotate(np.round(p.get_height()), (p.get_x()+p.get_width()/2., p.get_height()), \
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
Hogeschool van Amsterdam is the biggest university followed by Fontys Gogescholen and Hogeschool Rotterdam with 44853, 43524, 38568 studens respectively. Based on the number of study programmes that each university offers, Hogeschool van Amsterdam is only on the 9th place. Fontys Hogescholen is on the first place with 107 study programmes.
full.groupby('university').agg({'study_programme' : ['count'], 'Total' : ['min']})\
.sort_values(('Total', 'min'), ascending = False).head()
university_size = full.groupby('university')[['Total']].min().sort_values('Total', ascending = False)
university_size = university_size[university_size['Total'] != 0]
ax = university_size.plot(kind = 'bar', color = '#6ec1e4')
university_size = full.groupby('university')[['study_programme']].count().sort_values('study_programme', ascending = False)
university_size = university_size[university_size['study_programme'] > 5]
ax = university_size.plot(kind = 'bar', color = '#6ec1e4')
Out of the 1210 HBO study programmes, 1055 are full time, 563 are part time and 167 are dual
by_type = full[['full_time', 'part_time', 'dual']].sum()
ax = by_type.plot(kind = 'bar', color = '#6ec1e4')
for p in ax.patches:
ax.annotate(np.round(p.get_height()), (p.get_x()+p.get_width()/2., p.get_height()), \
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
Out of the 1210 HBO study programmes, 1148 are offered in Dutch, 126 are offered in English and 9 are offered in German.
by_language = full[['dutch', 'english', 'german']].sum()
ax = by_language.plot(kind = 'bar', color = '#6ec1e4')
for p in ax.patches:
ax.annotate(np.round(p.get_height()), (p.get_x()+p.get_width()/2., p.get_height()), \
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
Out of 1210 study programmes, there are 1022 with non-missing score values. The mean score is 6.8. The study programme with the highest score of 8.3 is Dans offered, by Hanzehogeschool Groningen. The study programme with the lowest score of 5.8 is Toegepaste Psychologie, offered by HBO Da Vinci Drechtsteden.
bb = full[['study_programme', 'score']]
ax = sns.boxplot(x="score", data=full, color = '#6ec1e4',)
full.sort_values('score', ascending = False).head()
full.sort_values('score').head()
by_branch = full.groupby('branch').agg({'score' : ['mean', 'count'], 'study_programme' :'count'}).sort_values([('score', 'mean')], ascending = False)
by_branch.head(15)
by_branch = full.groupby(['study_programme', 'branch']).agg({'score' : ['mean', 'count'], 'study_programme' :'count'}).sort_values([('score', 'mean')], ascending = False)
by_branch.head(15)
Location with Folium
Last, it is interesting to see the location of the study programmes. Python’s Folium package helps to visualize spatial data. For this particular visualization, I needed to obtain the latitude and the longitude of every city in the dataframe. I used Tableau for getting this information. The size of the circles on the map corresponds to the number of study programmes offered in each city.
geo = pd.read_csv('Lon_Lat.csv')
geo.head()
geo['Count of Study Programme'] = geo['Count of Study Programme'].astype('float')
geo.info()
geo['Latitude (generated)'] = geo['Latitude (generated)'].str.replace(',', '.')
geo['Latitude (generated)'] = geo['Latitude (generated)'].astype(float)
geo['Longitude (generated)'] = geo['Longitude (generated)'].str.replace(',', '.')
geo['Longitude (generated)'] = geo['Longitude (generated)'].astype(float)
geo.dropna(inplace = True)
location = folium.Map([geo['Latitude (generated)'].mean(),geo['Longitude (generated)'].mean()], tiles='Mapbox Bright', zoom_start=7)
for i in range(0,len(geo)):
folium.Circle(
location=[geo.iloc[i]['Latitude (generated)'], geo.iloc[i]['Longitude (generated)']],
popup=geo.iloc[i]['City'],
radius=geo.iloc[i]['Count of Study Programme']*100,
color='crimson',
fill=True,
fill_color='crimson'
).add_to(location)
location.save('mymap.html')
location