IntroductionWeb scraping, often called web crawling or web spidering, or “programmatically going over a collection of web pages and extracting data,” is a powerful tool for working with data on the web. Show With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity. In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We’ll use BrickSet, a community-run site that contains information about LEGO sets. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages on Brickset and extracts data about LEGO sets from each page, displaying the data to your screen. The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web. PrerequisitesTo complete this tutorial, you’ll need a local development environment for Python 3. You can follow How To Install and Set Up a Local Programming Environment for Python 3 to configure everything you need. Step 1 — Creating a Basic ScraperScraping is a two step process:
Both of those steps can be implemented in a number of ways in many languages. You can build a scraper from scratch using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you’ll need to handle concurrency so you can crawl more than one page at a time. You’ll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you’ll sometimes have to deal with sites that require specific settings and access patterns. You’ll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we’re going to use Python and Scrapy to build our scraper. Scrapy is one of the most popular and powerful Python scraping libraries; it takes a “batteries included” approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers don’t have to reinvent the wheel each time. It makes scraping a quick and fun process! Scrapy, like most Python packages, is on PyPI (also known as If you have a Python installation like the one outlined in the prerequisite for this tutorial, you already have
If you run into any issues with the installation, or you want to install Scrapy without using With Scrapy installed, let’s create a new folder for our project. You can do this in the terminal by running:
Now, navigate into the new directory you just created:
Then create
a new Python file for our scraper called
Or you can create the file using your text editor or graphical file manager. We’ll start by making a very basic scraper that uses Scrapy as its foundation. To do that, we’ll create a
Python class that subclasses
Open the scraper.py
Let’s break this down line by line: First, we import Next, we take the Then we give the spider the name Finally, we give our scraper a single URL to start from: http://brickset.com/sets/year-2016. If you open that URL in your browser, it will take you to a search results page, showing the first of many pages containing LEGO sets. Now let’s test out the scraper. You typically run Python files by running a command like
You’ll see something like this:
That’s a lot of output, so let’s break it down.
Now let’s pull some data from the page. Step 2 — Extracting Data from a PageWe’ve created a very basic program that pulls down a page, but it doesn’t do any scraping or spidering yet. Let’s give it some data to extract. If you look at the page we want to scrape, you’ll see it has the following structure:
When writing a scraper, it’s a good idea to look at the source of the HTML file and familiarize yourself with the structure. So here it is, with some things removed for readability:
Scraping this page is a two step process:
We’ll use CSS selectors for now since CSS is the easier option and a perfect fit for finding all the sets on the page. If you look at the HTML for the page, you’ll see that each set is specified with the class scraper.py
This code grabs all the sets on the page and loops over them to extract the data. Now let’s extract the data from those sets so we can display it. Another look at the source of the page we’re parsing tells us that the name of each set is stored within an
The scraper.py
Note: The trailing comma after You’ll notice two things going on in this code:
Save the file and run the scraper again:
This time you’ll see the names of the sets appear in the output:
Let’s keep expanding on this by adding new selectors for images, pieces, and miniature figures, or minifigs that come with a set. Take another look at the HTML for a specific set:
We can see a few things by examining this code:
So, let’s modify the scraper to get this new information: scraper.py
Save your changes and run the scraper again:
Now you’ll see that new data in the program’s output:
Now let’s turn this scraper into a spider that follows links. Step 3 — Crawling Multiple PagesWe’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too. You’ll notice that the top and bottom of each page has a little right carat (
As you can see, there’s a Modify your code as follows: scraper.py
First, we define a selector for the “next page” link, extract the first match, and check if it exists. The This means that once we go to the next page, we’ll look for a link to the next page there, and on that page we’ll look for a link to the next page, and so on, until we don’t find a link for the next page. This is the key piece of web scraping: finding and following links. In this example, it’s very linear; one page has a link to the next page until we’ve hit the last page, But you could follow links to tags, or other search results, or any other URL you’d like. Now, if you save your code and run the spider again you’ll see that it doesn’t just stop once it iterates through the first page of sets. It keeps on going through all 779 matches on 23 pages! In the grand scheme of things it’s not a huge chunk of data, but now you know the process by which you automatically find new pages to scrape. Here’s our completed code for this tutorial, using Python-specific highlighting: scraper.py
ConclusionIn this tutorial you built a fully-functional spider that extracts data from web pages in less than thirty lines of code. That’s a great start, but there’s a lot of fun things you can do with this spider. Here are some ways you could expand the code you’ve written. They’ll give you some practice scraping data.
That should be enough to get you thinking and experimenting. If you need more information on Scrapy, check out Scrapy’s official docs. For more information on working with data from the web, see our tutorial on “How To Scrape Web Pages with Beautiful Soup and Python 3”. How do I scrape a website with Python Scrapy?While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .
How do you scrape data from a website using Scrapy?Web Scraping with Scrapy and MongoDB. Installation. Scrapy. PyMongo.. Scrapy Project. Specify Data. Create the Spider. XPath Selectors. Extract the Data.. Store the Data in MongoDB. Pipeline Management.. Conclusion.. Is Scrapy better than BeautifulSoup?Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once. BeautifulSoup doesn't have the means to crawl and scrape pages by itself.
Is web scraping with Python legal?Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don't hold the right to share is illegal.
|