Web scraping with a Scrapy spider

Web scraping is very useful in collecting the necessary data for an analysis. It has many business applications such as monitoring brand mentions or gathering data for competition analysis.

Kiesjestudies.nl is a website that aggregates information about all the HBO study programmes in the Netherlands. In this project, I will create a simple spider with Python’s Scrapy library that crawls through the website, collects information about the name, city, score and language of every study program and saves the results in a csv file.

The crawler starts on a page that lists the names of all study programmes together with links for further information.
The study programme pages already give many useful information about the different universities that offer the same programme, their locations and the ratings of the programme.

Step 1

The first step is to import the necessary libraries: Scrapy and CSV, create a class for the scrapy spider, give the start url, and initialize and open an empty csv file for the output.

In [ ]:
import scrapy ,csv
from scrapy.crawler import CrawlerProcess

class StudiesSpider(scrapy.Spider):
    name = "studyprogrammes"
    start_urls= ['http://www.kiesjestudie.nl/allehboopleidingen.html']
    output = "study_programmes_netherlands.csv"

    def __init__(self):
        # empty outputfile
        open(self.output, "w").close()
 

Step 2

As a second step, two functions are defined to collect all the urls that the spider needs to crawl. The first function saves the urls of all different study programmes and calles the second function which collects the the urls of all different uiversities offering the same programme.

In [ ]:
    def parse(self, response):  
        links = response.css('li > a::attr(href)').extract()
        for link in links:
            yield response.follow(url = link, callback = self.parse_programmes)
      
    def parse_programmes( self, response ):
        programmes = response.css('a.block:not(a.lightlink)::attr(href)').extract()
        for programme in programmes:
            yield response.follow(url = programme, callback = self.parse_desc)
 

Step 3

The last function is used to collect the data and save it in the csv file with the help of CSS and XPath selectors and the extract method.

In [ ]:
    def parse_desc( self, response): 
        with open(self.output, "a", newline="") as f:
            writer = csv.writer(f)
            title = response.css('h2::text').extract_first()
            title = "\"" + title + "\""
            level = response.css('h3::text').extract_first()
            uni = response.css('h4 > a::text').extract_first()
            score = response.css('div.font40 > a::text').extract_first()
            language = response.xpath('//table[@class="hiddentable marginleft"]//tr[last()]//td[last()]/text()').extract_first()
            branch = response.css('table.hiddentable a:nth-of-type(1)::text').extract_first()
            city = response.css('table.hiddentable a::text').extract()[2]
            writer.writerow([title, level, uni, score, language, branch, city])
            yield {'Name': title, 'Level': level, 'University': uni, 'Score': score, 'Language': language, 'Branch': branch, 'City': city}
 

After these steps, the spider is complete:

In [1]:
import scrapy ,csv
from scrapy.crawler import CrawlerProcess

class StudiesSpider(scrapy.Spider):
    name = "studyprogrammes"
    start_urls= ['http://www.kiesjestudie.nl/allehboopleidingen.html']
    output = "study_programmes_netherlands.csv"

    def __init__(self):
        # empty outputfile
        open(self.output, "w").close()
        
    def parse(self, response):  
        links = response.css('li > a::attr(href)').extract()
        for link in links:
            yield response.follow(url = link, callback = self.parse_programmes)
      
    def parse_programmes( self, response ):
        programmes = response.css('a.block:not(a.lightlink)::attr(href)').extract()
        for programme in programmes:
            yield response.follow(url = programme, callback = self.parse_desc)
        
    def parse_desc( self, response): 
        with open(self.output, "a", newline="") as f:
            writer = csv.writer(f)
            title = response.css('h2::text').extract_first()
            title = "\"" + title + "\""
            level = response.css('h3::text').extract_first()
            uni = response.css('h4 > a::text').extract_first()
            score = response.css('div.font40 > a::text').extract_first()
            language = response.xpath('//table[@class="hiddentable marginleft"]//tr[last()]//td[last()]/text()').extract_first()
            branch = response.css('table.hiddentable a:nth-of-type(1)::text').extract_first()
            city = response.css('table.hiddentable a::text').extract()[2]
            writer.writerow([title, level, uni, score, language, branch, city])
            yield {'Name': title, 'Level': level, 'University': uni, 'Score': score, 'Language': language, 'Branch': branch, 'City': city}
 

Step 4

The only step left is to run the spider.

In [ ]:
c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',   
})
c.crawl(StudiesSpider)
c.start()
 

The output

With the help of the Pandas library, I can easily read in the dataset. I further explore and analyse the obtained data in the Data data cleaning and exploration project.

In [1]:
import pandas as pd
df = pd.read_csv('study_programmes_netherlands.csv', encoding='latin-1', header = None)
In [2]:
df.head()
Out[2]:
 0123456
0“Bio-informatica”HBO bachelor (voltijd)Hogeschool Leiden6,74NederlandsInformaticaLeiden
1“Bio-informatica”HBO bachelor (voltijd)Hogeschool van Arnhem en Nijmegen locatie Nijm…7,43NederlandsInformaticaNijmegen
2“Bestuurskunde/Overheidsmanagement”HBO bachelor (voltijd / deeltijd / duaal)De Haagse Hogeschool6,48NederlandsBestuurskundeDen Haag
3“Bestuurskunde”HBO bachelor (duaal)Hogeschool NCOINederlandsBestuurskundeDiverse locaties
4“Bedrijfskundige Informatica”HBO bachelor (deeltijd)LOI HogeschoolNederlandsInformaticaDiverse locaties

You can download the csv file by clicking on the button below