Data Cleaning and Exploration

Data can be messy, especially when manual data input is involved. Data cleaning was the first task that became easier after I got started with Python.

In this project, I will leverage Python’s Pandas and NumPy libraries to clean and explore a dataset.

By creating a spider with the Scrapy module, I obtained information about all study programmes in the Netherlands from the website kiesjestudie.nl. The script for retrieving the data is available in this project: Web scraping using Scrapy spiders.

The dataset contains many useful information, however, it needs to be cleaned before further analysis. The project contains the following three stages:

  1. Cleaning the original dataset: checking for missing values, setting the right data types, separating columns
  2. Merging datasets: combining data from multiple sources in order to obtain more information
  3. Exploring and visualizing the dataset: understanding the data and viewing the most relevant features

Let’s start by importing the relevant modules and exploring the data!

In [1]:
import pandas as pd
import numpy as np
In [2]:
df = pd.read_csv('kiesjestudies_dataset.csv', encoding='latin-1', header = None)
In [3]:
df.head()
Out[3]:
0 1 2 3 4 5 6
0 “Bio-informatica” HBO bachelor (voltijd) Hogeschool Leiden 6,74 Nederlands Informatica Leiden
1 “Bio-informatica” HBO bachelor (voltijd) Hogeschool van Arnhem en Nijmegen locatie Nijm… 7,43 Nederlands Informatica Nijmegen
2 “Bestuurskunde/Overheidsmanagement” HBO bachelor (voltijd / deeltijd / duaal) De Haagse Hogeschool 6,48 Nederlands Bestuurskunde Den Haag
3 “Bestuurskunde” HBO bachelor (duaal) Hogeschool NCOI Nederlands Bestuurskunde Diverse locaties
4 “Bedrijfskundige Informatica” HBO bachelor (deeltijd) LOI Hogeschool Nederlands Informatica Diverse locaties
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1210 entries, 0 to 1209
Data columns (total 7 columns):
0    1210 non-null object
1    1210 non-null object
2    1210 non-null object
3    1210 non-null object
4    1210 non-null object
5    1210 non-null object
6    1210 non-null object
dtypes: object(7)
memory usage: 66.2+ KB

First steps:

At first glance, we can already discover some issues with the dataset such as missing column names, wrong data formats and unnecessary characters. So first we need to:

  • add column names
  • replace quotation marks in the first column
  • replace missing score values (marked as “…”) with NaN
  • change the data type of the score column from string to float
  • create multiple columns for chategorical data
In [5]:
column_list = ['study_programme', 'level', 'university', 'score', 'language', 'branch', 'city']
df.columns = column_list
In [6]:
df['study_programme'] = df['study_programme'].str.replace("\"", "")
In [7]:
df['score'].replace('...', np.NaN, inplace = True)
In [8]:
df['level'].unique()
Out[8]:
array(['HBO bachelor (voltijd)',
       'HBO bachelor (voltijd / deeltijd / duaal)',
       'HBO bachelor (duaal)', 'HBO bachelor (deeltijd)',
       'HBO bachelor (voltijd / deeltijd)',
       'HBO bachelor (voltijd / duaal)',
       'HBO bachelor (deeltijd / duaal)'], dtype=object)
In [9]:
new = df["level"].str.split("(", n = 1, expand = True) 
df['degree']= new[0] 
df['type']= new[1]
df['type'] = df['type'].str.replace(")", "")

The type and the language columns contain chategorical data. There are multiple ways to map chategorical data. In this project, I will separate columns for the different chategories with 0 and 1 values.

In [10]:
df['full_time'] = df['type'].str.contains('voltijd').astype(int)
df['part_time'] = df['type'].str.contains('deeltijd').astype(int)
df['dual'] = df['type'].str.contains('duaal').astype(int)
In [11]:
df['language'].unique()
Out[11]:
array(['Nederlands', 'Engels', 'Nederlands en Engels',
       'Nederlands en Duits', 'Nederlands, Duits en Engels',
       'Nederlands en de taal van studie'], dtype=object)
In [12]:
df['dutch'] = df['language'].str.contains('Nederlands').astype(int)
df['english'] = df['language'].str.contains('Engels').astype(int)
df['german'] = df['language'].str.contains('Duits').astype(int)
In [13]:
df.drop(columns = ['level', 'type', 'language'], inplace = True)
In [14]:
df['score'].replace('...', np.NaN, inplace = True)
df['score'] = df['score'].str.replace(',', '.')
df['score'] = df['score'].astype(float)
In [15]:
df.head()
Out[15]:
study_programme university score branch city degree full_time part_time dual dutch english german
0 Bio-informatica Hogeschool Leiden 6.74 Informatica Leiden HBO bachelor 1 0 0 1 0 0
1 Bio-informatica Hogeschool van Arnhem en Nijmegen locatie Nijm… 7.43 Informatica Nijmegen HBO bachelor 1 0 0 1 0 0
2 Bestuurskunde/Overheidsmanagement De Haagse Hogeschool 6.48 Bestuurskunde Den Haag HBO bachelor 1 1 1 1 0 0
3 Bestuurskunde Hogeschool NCOI NaN Bestuurskunde Diverse locaties HBO bachelor 0 0 1 1 0 0
4 Bedrijfskundige Informatica LOI Hogeschool NaN Informatica Diverse locaties HBO bachelor 0 1 0 1 0 0
In [16]:
df.dtypes
Out[16]:
study_programme     object
university          object
score              float64
branch              object
city                object
degree              object
full_time            int32
part_time            int32
dual                 int32
dutch                int32
english              int32
german               int32
dtype: object

Merging datasets

After cleaning, the dataset provides a good overview of the HBO study programmes in the Netherlands with information about the university, score, branch, degree, type and language. It can be further broadened by adding information from other datasets. I will merge the original dataset with two more datasets in order to get an even better picture of the study programmes.

1) Information about the province in which the university is located: the dataset with the information was obtained by scraping a WikiPedia article (see script below) with the help of Python’s BeautifulSoup library.

2) Information about the number of students attending a university: the dataset was obtained from the website of Dienst Uitvoering Onderwijs (an institution in charge of the study finances in the Netherlands).

Both new datasets need to be celaned before mergin it with the original.

First dataset

In [17]:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://nl.wikipedia.org/wiki/Lijst_van_Nederlandse_gemeenten_per_provincie'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")

regions = page_soup.findAll("span", {'class': 'mw-headline'})[0:12]

file = 'dutch_provinces_and_cities.csv'

f = open(file, 'w')

headers = 'province,city\n'

f.write(headers)

for region in range(len(regions)):
    province = regions[region].span.text.strip()
    ol = page_soup.findAll('h2')[region + 1].find_next('ol')
    cities = ol.findAll("span", {'class': 'sorttext'})
    for city in cities:
        f.write(province + ',' + city.text.strip() + '\n')

f.close()
In [18]:
provinces = pd.read_csv('dutch_provinces_and_cities.csv', encoding='latin-1')
In [19]:
new = pd.merge(df, provinces, on='city', how = 'left')
In [20]:
new.head()
Out[20]:
study_programme university score branch city degree full_time part_time dual dutch english german province
0 Bio-informatica Hogeschool Leiden 6.74 Informatica Leiden HBO bachelor 1 0 0 1 0 0 Zuid-Holland
1 Bio-informatica Hogeschool van Arnhem en Nijmegen locatie Nijm… 7.43 Informatica Nijmegen HBO bachelor 1 0 0 1 0 0 Gelderland
2 Bestuurskunde/Overheidsmanagement De Haagse Hogeschool 6.48 Bestuurskunde Den Haag HBO bachelor 1 1 1 1 0 0 Zuid-Holland
3 Bestuurskunde Hogeschool NCOI NaN Bestuurskunde Diverse locaties HBO bachelor 0 0 1 1 0 0 NaN
4 Bedrijfskundige Informatica LOI Hogeschool NaN Informatica Diverse locaties HBO bachelor 0 1 0 1 0 0 NaN
In [21]:
missing = new[new['province'].isnull()]
missing['city'].unique()
Out[21]:
array(['Diverse locaties', 'Panningen', 'Den Bosch', 'Doorn', 'Sittard',
       'Velp', 'Veghel', 'Driebergen'], dtype=object)
In [22]:
new.loc[new.city == 'Panningen', 'province'] = "Limburg"
new.loc[new.city == 'Doorn', 'province'] = "Utrecht"
new.loc[new.city == 'Sittard', 'province'] = "Limburg"
new.loc[new.city == 'Velp', 'province'] = "Limburg"
new.loc[new.city == 'Sittard', 'province'] = "Gelderland"
new.loc[new.city == 'Veghel', 'province'] = 'Noord-Brabant'
new.loc[new.city == 'Driebergen', 'province'] = "Utrecht"
new.loc[new.city == 'Den Bosch', 'province'] = 'Noord-Brabant'
In [23]:
new['province'].unique()
Out[23]:
array(['Zuid-Holland', 'Gelderland', nan, 'Limburg', 'Friesland',
       'Flevoland', 'Noord-Holland', 'Noord-Brabant', 'Overijssel',
       'Utrecht', 'Zeeland (provincie)', 'Utrecht (provincie)',
       'Groningen (provincie)', 'Drenthe'], dtype=object)
In [24]:
new['province'] = new['province'].astype(str)
new['province'] = new.province.apply(lambda x: x.replace(' (provincie)',''))
In [25]:
new.head()
Out[25]:
study_programme university score branch city degree full_time part_time dual dutch english german province
0 Bio-informatica Hogeschool Leiden 6.74 Informatica Leiden HBO bachelor 1 0 0 1 0 0 Zuid-Holland
1 Bio-informatica Hogeschool van Arnhem en Nijmegen locatie Nijm… 7.43 Informatica Nijmegen HBO bachelor 1 0 0 1 0 0 Gelderland
2 Bestuurskunde/Overheidsmanagement De Haagse Hogeschool 6.48 Bestuurskunde Den Haag HBO bachelor 1 1 1 1 0 0 Zuid-Holland
3 Bestuurskunde Hogeschool NCOI NaN Bestuurskunde Diverse locaties HBO bachelor 0 0 1 1 0 0 nan
4 Bedrijfskundige Informatica LOI Hogeschool NaN Informatica Diverse locaties HBO bachelor 0 1 0 1 0 0 nan

Second dataset

In [26]:
numbers = pd.read_excel('01-ingeschrevenen-hbo-2018.xlsx')
In [27]:
numbers.head()
Out[27]:
PROVINCIE GEMEENTENUMMER GEMEENTENAAM SOORT INSTELLING TYPE HOGER ONDERWIJS BRIN NUMMER ACTUEEL INSTELLINGSNAAM ACTUEEL CROHO ONDERDEEL CROHO SUBONDERDEEL OPLEIDINGSCODE ACTUEEL 2014 Man 2014 Vrouw 2015 Man 2015 Vrouw 2016 Man 2016 Vrouw 2017 Man 2017 Vrouw 2018 Man 2018 Vrouw
0 Friesland 80.0 LEEUWARDEN reguliere inst. associate degree 31FR NHL Stenden Hogeschool economie n.v.t. (economie) 80009.0 5.0 22.0 4.0 20.0 3.0 26.0 16.0 60.0 12.0 69.0
1 Friesland 80.0 LEEUWARDEN reguliere inst. associate degree 31FR NHL Stenden Hogeschool economie n.v.t. (economie) 80010.0 114.0 56.0 108.0 53.0 93.0 45.0 119.0 46.0 136.0 61.0
2 Friesland 80.0 LEEUWARDEN reguliere inst. associate degree 31FR NHL Stenden Hogeschool economie n.v.t. (economie) 80010.0 30.0 31.0 31.0 29.0 46.0 18.0 39.0 25.0 38.0 18.0
3 Friesland 80.0 LEEUWARDEN reguliere inst. associate degree 31FR NHL Stenden Hogeschool economie n.v.t. (economie) 80040.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 2.0 5.0
4 Friesland 80.0 LEEUWARDEN reguliere inst. associate degree 31FR NHL Stenden Hogeschool economie n.v.t. (economie) 80040.0 1.0 2.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0

5 rows Ă— 23 columns

In [28]:
students = numbers.groupby('INSTELLINGSNAAM ACTUEEL')['2018 Man', '2018 Vrouw'].sum()
In [29]:
full = pd.merge(new, students, left_on='university', right_on = 'INSTELLINGSNAAM ACTUEEL', how = 'left')
full.head()
Out[29]:
study_programme university score branch city degree full_time part_time dual dutch english german province 2018 Man 2018 Vrouw
0 Bio-informatica Hogeschool Leiden 6.74 Informatica Leiden HBO bachelor 1 0 0 1 0 0 Zuid-Holland 3778.0 6995.0
1 Bio-informatica Hogeschool van Arnhem en Nijmegen locatie Nijm… 7.43 Informatica Nijmegen HBO bachelor 1 0 0 1 0 0 Gelderland NaN NaN
2 Bestuurskunde/Overheidsmanagement De Haagse Hogeschool 6.48 Bestuurskunde Den Haag HBO bachelor 1 1 1 1 0 0 Zuid-Holland 13545.0 12026.0
3 Bestuurskunde Hogeschool NCOI NaN Bestuurskunde Diverse locaties HBO bachelor 0 0 1 1 0 0 nan NaN NaN
4 Bedrijfskundige Informatica LOI Hogeschool NaN Informatica Diverse locaties HBO bachelor 0 1 0 1 0 0 nan NaN NaN
In [30]:
university_list = students.index.tolist()

for l in new['university']:
    for n in university_list:
        if n in l:
            new['university'].replace(l, n, inplace = True)
In [31]:
full = pd.merge(new, students, left_on='university', right_on = 'INSTELLINGSNAAM ACTUEEL', how = 'left')
full.head()
Out[31]:
study_programme university score branch city degree full_time part_time dual dutch english german province 2018 Man 2018 Vrouw
0 Bio-informatica Hogeschool Leiden 6.74 Informatica Leiden HBO bachelor 1 0 0 1 0 0 Zuid-Holland 3778.0 6995.0
1 Bio-informatica Hogeschool van Arnhem en Nijmegen 7.43 Informatica Nijmegen HBO bachelor 1 0 0 1 0 0 Gelderland 17577.0 17499.0
2 Bestuurskunde/Overheidsmanagement De Haagse Hogeschool 6.48 Bestuurskunde Den Haag HBO bachelor 1 1 1 1 0 0 Zuid-Holland 13545.0 12026.0
3 Bestuurskunde Hogeschool NCOI NaN Bestuurskunde Diverse locaties HBO bachelor 0 0 1 1 0 0 nan NaN NaN
4 Bedrijfskundige Informatica LOI Hogeschool NaN Informatica Diverse locaties HBO bachelor 0 1 0 1 0 0 nan NaN NaN
In [32]:
full.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1210 entries, 0 to 1209
Data columns (total 15 columns):
study_programme    1210 non-null object
university         1210 non-null object
score              1022 non-null float64
branch             1210 non-null object
city               1210 non-null object
degree             1210 non-null object
full_time          1210 non-null int32
part_time          1210 non-null int32
dual               1210 non-null int32
dutch              1210 non-null int32
english            1210 non-null int32
german             1210 non-null int32
province           1210 non-null object
2018 Man           903 non-null float64
2018 Vrouw         903 non-null float64
dtypes: float64(3), int32(6), object(6)
memory usage: 122.9+ KB
In [33]:
full['Total'] = full[['2018 Man', '2018 Vrouw']].sum(axis = 1, skipna = True)
In [34]:
full.rename(columns = {'2018 Man' : 'Male', '2018 Vrouw': 'Female'}, inplace = True)
In [35]:
full.to_csv('complete_university_dataset.csv')

Exploring the data

Data exploration helps to understand the dataset and identify the most relevant variables before the analysis. Besides cleaning the data, Python’s Pandas library is also useful for cleaning and filtering. I will use three additional libraries for data visualization: Matplotlib, Seaborn and Folium.

In [43]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 6)
plt.style.use('ggplot')
import seaborn as sns
import folium
In [37]:
full.head()
Out[37]:
study_programme university score branch city degree full_time part_time dual dutch english german province Male Female Total
0 Bio-informatica Hogeschool Leiden 6.74 Informatica Leiden HBO bachelor 1 0 0 1 0 0 Zuid-Holland 3778.0 6995.0 10773.0
1 Bio-informatica Hogeschool van Arnhem en Nijmegen 7.43 Informatica Nijmegen HBO bachelor 1 0 0 1 0 0 Gelderland 17577.0 17499.0 35076.0
2 Bestuurskunde/Overheidsmanagement De Haagse Hogeschool 6.48 Bestuurskunde Den Haag HBO bachelor 1 1 1 1 0 0 Zuid-Holland 13545.0 12026.0 25571.0
3 Bestuurskunde Hogeschool NCOI NaN Bestuurskunde Diverse locaties HBO bachelor 0 0 1 1 0 0 nan NaN NaN 0.0
4 Bedrijfskundige Informatica LOI Hogeschool NaN Informatica Diverse locaties HBO bachelor 0 1 0 1 0 0 nan NaN NaN 0.0

Based on the dataset from www.kiesjestudies.nl, there are 1210 HB study programmes in the Netherlands. In this project, I compare them along 16 dimensions.

In [38]:
full.shape
Out[38]:
(1210, 16)
In [39]:
full.describe()
Out[39]:
score full_time part_time dual dutch english german Male Female Total
count 1022.000000 1210.000000 1210.000000 1210.000000 1210.000000 1210.000000 1210.000000 903.000000 903.000000 1210.000000
mean 6.842476 0.871901 0.465289 0.138017 0.948760 0.104132 0.007438 14083.895903 14067.254707 21008.668595
std 0.326294 0.334338 0.499000 0.345060 0.220577 0.305558 0.085958 6471.749742 5934.682852 16161.830588
min 5.830000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 118.000000 274.000000 0.000000
25% 6.630000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 10209.000000 10875.000000 0.000000
50% 6.810000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 14594.000000 14431.000000 24640.000000
75% 7.020000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 19101.500000 19065.000000 34211.000000
max 8.300000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 22530.000000 22323.000000 44853.000000

Zuid-Holland province has the most universities: 23 universities offer 114 HBO study programmes. Noord-Brabant has the highest number of study programmes offered by 14 universities.

In [40]:
by_province = full.groupby('province')['university', 'study_programme'].nunique().sort_values('university', ascending = False)
by_province
Out[40]:
university study_programme
province
Zuid-Holland 23 114
Noord-Holland 21 110
Utrecht 17 86
Noord-Brabant 14 115
Gelderland 13 87
Limburg 8 53
Overijssel 6 92
Friesland 5 70
Groningen 5 50
nan 5 63
Drenthe 4 11
Flevoland 2 22
Zeeland 1 22
In [44]:
ax = by_province.plot(kind = 'bar', color = ['#ff9933', '#6ec1e4'])

for p in ax.patches: 
    ax.annotate(np.round(p.get_height()), (p.get_x()+p.get_width()/2., p.get_height()), \
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

Hogeschool van Amsterdam is the biggest university followed by Fontys Gogescholen and Hogeschool Rotterdam with 44853, 43524, 38568 studens respectively. Based on the number of study programmes that each university offers, Hogeschool van Amsterdam is only on the 9th place. Fontys Hogescholen is on the first place with 107 study programmes.

In [45]:
full.groupby('university').agg({'study_programme' : ['count'], 'Total' : ['min']})\
.sort_values(('Total', 'min'), ascending = False).head()
Out[45]:
study_programme Total
count min
university
Hogeschool van Amsterdam 56 44853.0
Fontys Hogescholen 107 43524.0
Hogeschool Rotterdam 63 38568.0
Hogeschool van Arnhem en Nijmegen 73 35076.0
Hogeschool Utrecht 64 34211.0
In [46]:
university_size = full.groupby('university')[['Total']].min().sort_values('Total', ascending = False)
university_size = university_size[university_size['Total'] != 0]

ax = university_size.plot(kind = 'bar', color = '#6ec1e4')
In [47]:
university_size = full.groupby('university')[['study_programme']].count().sort_values('study_programme', ascending = False)
university_size = university_size[university_size['study_programme'] > 5]

ax = university_size.plot(kind = 'bar', color = '#6ec1e4')

Out of the 1210 HBO study programmes, 1055 are full time, 563 are part time and 167 are dual

In [48]:
by_type = full[['full_time', 'part_time', 'dual']].sum()

ax = by_type.plot(kind = 'bar', color = '#6ec1e4')

for p in ax.patches: 
    ax.annotate(np.round(p.get_height()), (p.get_x()+p.get_width()/2., p.get_height()), \
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

Out of the 1210 HBO study programmes, 1148 are offered in Dutch, 126 are offered in English and 9 are offered in German.

In [49]:
by_language = full[['dutch', 'english', 'german']].sum()

ax = by_language.plot(kind = 'bar', color = '#6ec1e4')

for p in ax.patches: 
    ax.annotate(np.round(p.get_height()), (p.get_x()+p.get_width()/2., p.get_height()), \
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

Out of 1210 study programmes, there are 1022 with non-missing score values. The mean score is 6.8. The study programme with the highest score of 8.3 is Dans offered, by Hanzehogeschool Groningen. The study programme with the lowest score of 5.8 is Toegepaste Psychologie, offered by HBO Da Vinci Drechtsteden.

In [50]:
bb = full[['study_programme', 'score']]
ax = sns.boxplot(x="score", data=full, color = '#6ec1e4',)
In [51]:
full.sort_values('score', ascending = False).head()
Out[51]:
study_programme university score branch city degree full_time part_time dual dutch english german province Male Female Total
950 Dans Hanzehogeschool Groningen 8.30 Dans Amsterdam HBO bachelor 1 0 0 1 0 0 Noord-Holland 14594.0 14744.0 29338.0
959 Docent Dans Hanzehogeschool Groningen 8.13 Dans Amsterdam HBO bachelor 1 0 0 1 0 0 Noord-Holland 14594.0 14744.0 29338.0
433 Management in de Zorg Hanzehogeschool Groningen 8.10 Management in de Zorg Groningen HBO bachelor 1 1 1 1 0 0 Groningen 14594.0 14744.0 29338.0
957 Docent Dans Hanzehogeschool Groningen 8.08 Dans Groningen HBO bachelor 1 0 0 1 0 0 Groningen 14594.0 14744.0 29338.0
956 Docent Dans Amsterdamse Hogeschool voor de Kunsten 7.92 Dans Amsterdam HBO bachelor 1 0 0 1 0 0 Noord-Holland 1339.0 1681.0 3020.0
In [52]:
full.sort_values('score').head()
Out[52]:
study_programme university score branch city degree full_time part_time dual dutch english german province Male Female Total
147 Toegepaste Psychologie HBO Da Vinci Drechtsteden 5.83 Toegepaste Psychologie Dordrecht HBO bachelor 1 1 0 1 0 0 Zuid-Holland NaN NaN 0.0
181 Sociaal-Juridische Dienstverlening NTI 5.99 Sociaal Juridische Dienstverlening Diverse locaties HBO bachelor 0 1 0 1 0 0 nan NaN NaN 0.0
1112 Communicatie Hogeschool Inholland 6.04 Communicatie Amsterdam HBO bachelor 1 0 0 1 0 0 Noord-Holland 10209.0 14431.0 24640.0
713 Lerarenopleiding 2e graad Engels Hogeschool Inholland 6.04 Lerarenopleiding talen Amsterdam HBO bachelor 1 1 0 1 0 0 Noord-Holland 10209.0 14431.0 24640.0
234 SPD Bedrijfsadministratie NTI 6.13 Accountancy en fiscaal Diverse locaties HBO bachelor 0 1 0 1 0 0 nan NaN NaN 0.0
In [53]:
by_branch = full.groupby('branch').agg({'score' : ['mean', 'count'], 'study_programme' :'count'}).sort_values([('score', 'mean')], ascending = False)
by_branch.head(15)
Out[53]:
score study_programme
mean count count
branch
Dans 7.564167 12 14
Hotelschool 7.303333 9 12
Toneel 7.170000 11 11
Mondzorgkunde 7.155000 4 5
Management in de Zorg 7.150000 5 8
Technische natuurkunde 7.070000 3 3
Muziek 7.061500 20 23
Vastgoed en Makelaardij 7.057500 4 5
Elektrotechniek 7.056923 13 14
Toegepaste Gerontologie 7.055000 2 2
Financial Services Management 6.998571 7 13
Zee- en Luchtvaart 6.990000 5 5
Logopedie 6.988750 8 9
Toerisme en Vrijetijdskunde 6.970000 17 20
Bedrijfskunde en management 6.969677 31 53
In [54]:
by_branch = full.groupby(['study_programme', 'branch']).agg({'score' : ['mean', 'count'], 'study_programme' :'count'}).sort_values([('score', 'mean')], ascending = False)
by_branch.head(15)
Out[54]:
score study_programme
mean count count
study_programme branch
Marketing Communicatie Communicatie 7.780000 1 1
Hogere Managementopleiding voor de Mobiliteitsbranche Bedrijfskunde en management 7.700000 1 1
Bedrijfskunde Bedrijfskunde en management 7.600000 1 8
Docent Dans Dans 7.593333 6 7
International Business Management International Business and Management 7.590000 5 5
Hotel- en Eventmanagement Hotelschool 7.570000 5 5
Hospitality & Events Management Toerisme en Vrijetijdskunde 7.540000 1 3
Dans Dans 7.535000 6 7
BBA Bedrijfskunde Bedrijfskunde en management 7.465000 2 3
Informatie en Communicatie Technologie Informatica 7.450000 1 1
Geo Media & Design Ruimtelijke ordening en planologie 7.360000 1 1
Global Project and Change Management Bedrijfskunde en management 7.330000 1 1
Bio-informatica Informatica 7.266667 3 3
Internationaal Toeristisch Management Toerisme en Vrijetijdskunde 7.242000 5 5
Lerarenopleiding 2e graad Aardrijkskunde Lerarenopleidingen maatschappijvakken 7.231250 8 11

Location with Folium

Last, it is interesting to see the location of the study programmes. Python’s Folium package helps to visualize spatial data. For this particular visualization, I needed to obtain the latitude and the longitude of every city in the dataframe. I used Tableau for getting this information. The size of the circles on the map corresponds to the number of study programmes offered in each city.

In [55]:
geo = pd.read_csv('Lon_Lat.csv')
geo.head()
Out[55]:
City Province Count of Study Programme Distinct count of University Latitude (generated) Longitude (generated)
0 Zwijndrecht Zuid-Holland 1 1 51,817 4,633
1 Zoetermeer Zuid-Holland 1 1 52,057 4,493
2 Rotterdam Zuid-Holland 93 10 51,922 4,479
3 Leiden Zuid-Holland 24 3 52,158 4,493
4 Gouda Zuid-Holland 9 2 52,016 4,708
In [56]:
geo['Count of Study Programme'] = geo['Count of Study Programme'].astype('float')
geo.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 6 columns):
City                            53 non-null object
Province                        52 non-null object
Count of Study Programme        53 non-null float64
Distinct count of University    53 non-null int64
Latitude (generated)            45 non-null object
Longitude (generated)           45 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 2.6+ KB
In [57]:
geo['Latitude (generated)'] = geo['Latitude (generated)'].str.replace(',', '.')
geo['Latitude (generated)'] = geo['Latitude (generated)'].astype(float)
In [58]:
geo['Longitude (generated)'] = geo['Longitude (generated)'].str.replace(',', '.')
geo['Longitude (generated)'] = geo['Longitude (generated)'].astype(float)
In [59]:
geo.dropna(inplace = True)
In [62]:
location = folium.Map([geo['Latitude (generated)'].mean(),geo['Longitude (generated)'].mean()], tiles='Mapbox Bright', zoom_start=7)
 
for i in range(0,len(geo)):
   folium.Circle(
      location=[geo.iloc[i]['Latitude (generated)'], geo.iloc[i]['Longitude (generated)']],
      popup=geo.iloc[i]['City'],
      radius=geo.iloc[i]['Count of Study Programme']*100,
      color='crimson',
      fill=True,
      fill_color='crimson'
   ).add_to(location)
 
location.save('mymap.html')
In [63]:
location
Out[63]:
In [ ]: