Scrapy VS. urllib2, requests, Beautiful Soup and lxml

In this article we will compare Scrapy with the most popular Python web-scraping related libraries; urllib2, requests, Beautiful Soup and lxml. They are all amazing libraries with great adoption and various degrees of performance and usability and they are commonly used to perform web scraping tasks. Let's get this battle started.

Methodology


In this article we will compare a Scrapy scraper with a configurable Python 2.7 scraper that can use any of the aforementioned libraries and a number of Threads to perform crawls. We will do several runs till we get a clear view of their performance characteristics.

Our benchmark problem


We will reuse the Wikipedia benchmark problem. We will try to scrape 1000 random wikipedia pages and extract their URLs and titles:


We use https://en.wikipedia.org/wiki/Special:Random to get redirects to random articles. Both urllib2 and the requests library support redirects seamlessly. Before jumping into coding, lets begin with a few words about them.

urllib2


The urllib2 library provides an API for fetching internet resources identified by URLs. It is designed to be extended by individual applications to support new protocols or add variations to existing protocols (such as handling HTTP basic authentication). Its big advantage is that it's included in the Python standard library so as long as we have Python installed we are good to go. More information can be found here.

Note that we will use Python 2.7 here. If you try to use Python 3, when you try to import urllib2 you get: ImportError: No module named 'urllib2'. You will have to import urllib and use urllib.request.urlopen() instead of urllib2.urlopen.

The requests Python library


Similarly to urllib2, requests is module that allows us to fetch urls and it is getting more and more popular because it's very easy to use and it has a quite impressive set of features:
  • Supports international Domains and URLs
  • Supports Keep-Alive and uses connection pooling
  • Supports sessions with cookie persistence out of the box
  • Provides browser-style SSL verification
  • Provides basic/digest authentication
  • Automatically decompresses responses
  • Supports Unicode response bodies
  • Allows multipart File Uploads
  • Supports connection timeouts
  • Has .netrc support
  • It is thread-safe
  • ... many more
Additionally, the official documentation is awesome!
The requests library has to be installed using the instructions provided here. It supports Python versions 2.6 - 3.4.

Beautiful Soup


Beautiful Soup is module for parsing HTML documents. It's quite robust and it handles nicely malformed markup. It's named after the expression "tag soup", which is used to describe severely invalid markup. Beautiful Soup creates a parse tree that can be used to extract data from HTML. The official documentation is comprehensive, easy to read and with lots of examples.
Beautiful Soup is available for Python 2.6+ and Python 3. It has to be installed explicitly by following the instructions one can find here.

The lxml library


lxml is the most feature-rich Python library for processing XML and HTML. It's also very fast and memory efficient.

Scrapy selectors are built over lxml and Beautiful Soup also supports it as a parser. One might also use lxml.soupparser which makes lxml use Beautiful Soup as a parser backend in order to handle severely broken markup.

lxml's official documentation is not that beginner friendly we reckon. One might find this a much more beginner-friendly introduction on lxml. A fun fact is that this article was written by Kenneth Reitz, the original author of the requests library.

lxml is not installed by default, too, and one can install it by following the instructions here.

The Scrapy web crawler


A Scrapy spider that crawls wikipedia is basically a straightforward application of the scrapy project bootstraping process;

$ scrapy startproject wikipedia_scraper
$ cd wikipedia_scraper
$ scrapy genspider wikipedia en.wikipedia.com

, followed by some slight modifications on the auto-generated spider.

import scrapy
 
class WikipediaSpider(scrapy.Spider):
    name = "wikipedia" # Name of our spider
    allowed_domains = ['en.wikipedia.com']
    start_urls = (
        'https://en.wikipedia.org/wiki/Special:Random',
    ) * 1000
 
    def parse(self, response):
        title = response.xpath('//h1/text()').extract_first()
        return { 'title': title, 'url': response.url }

We can run a simple crawl with scrapy crawl wikipedia but in this case we will need to compare Scrapy against libraries that will run using multiple threads thus - in order to make a fair comparison - we will need to know how to adjust the amount of concurrency with which we hit wikipedia. We can do this by adjusting the CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS settings. For example, if we want to run a Scrapy crawl allowing up to 20 concurrent connections we can do:

scrapy crawl wikipedia \
-s CONCURRENT_REQUESTS_PER_DOMAIN=20 \
-s CONCURRENT_REQUESTS=20

Notice that this, due to Scrapy's architecture, doesn't mean 20 threads. Scrapy will still use a single thread to do most of the processing but it will allow up to 20 concurrent connections. It could potentially open thousands of connections - still from a single thread - but this would likely annoy our target website if it's a single one (in our case Wikipedia).

urllib2, requests, Beautiful Soup, lxml Scraper


We will now incorporate all these libraries on a single hand-made crawler. It's very useful to clarify right now that our crawling job has two distinct steps; the "download" step which takes a URL and returns the text of page's body and the "parse/extract" step which takes the outputs of the previous step, parses it and extracts the title.


The four libraries we are examining aren't all suitable for everything. For example, you can't use urllib2 to parse HTML and extract titles. We can only use urllib2 or requests to download pages and lxml or Beautiful Soup to parse and extract data from pages. Here's how:

Downloading pages


In order to use urrlib2 we need the following code:
import urllib2

def download_with_urllib2(url):
     r = urllib2.urlopen(url)
     return r.geturl(), r.read()

we can see that we have a quite straightforward use of the library. In order to do the same with requests we use the following code:

import requests

def download_with_requests(url):
    r = requests.get(url)
    if r.history and ('Location' in r.history[-1].headers):
        url = r.history[-1].headers['Location']
    return url, r.content

We see quite some code here. Please don't judge requests by this code. Extracting the final URL after possible redirects is a quite unusual operation and it's somewhat complex with requests. If we didn't need to do that, we could get the target page with about as much code as with urllib2: requests.get(url).content. You would have to trust us on that requests is very usable on the most common use-cases.

Parsing pages and extracting the title


For this part we can use the XPath expression //h1 with lxml, as we can see here:
from lxml import html
    
def parse_with_lxml(url, page):
    tree = html.fromstring(page)
    return url, tree.xpath("//h1")[0].text_content()

The code is a bit hacky but overall quite simple and readable. Beautiful Soup on the other hand has an, indeed, very beautiful API which returns us the first matching element by simply using .h1:
from bs4 import BeautifulSoup

def parse_with_bs4(url, page):
    soup = BeautifulSoup(page, "html.parser")
    return url, soup.h1.get_text()

Note that we set the parser type explicitly to "html.parser" which by the way is the default one (i.e. we could have omitted it). As we said earlier Beautiful Soup supports many parsers including lxml and this is something to keep in mind while reading the performance results below. With other parsing back-ends the performance might be different and it is left as an exercise for the reader to try the other parser types. A list of them can be found here.

Running a job


With all this tooling in place one can write a generic function run_crawl(crawl, parse, todo) that is able to run a job with every possible combination of libraries:
import logging


logger = logging.getLogger(__name__)
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)


def run_crawl(crawl, parse, urls):

    logger.info("I'm a job using %s and %s running on %d urls" %
                (crawl, parse, len(urls)))

    for url in urls:

        logger.debug("Working on url: %s" % url)

        download_method = (download_with_urllib2
                           if crawl == "urllib2"
                           else download_with_requests)

        parse_method = (parse_with_lxml
                        if parse == "lxml"
                        else parse_with_bs4)

        url, page = download_method(url)

        url, title = parse_method(url, page)

        logger.info("url: %s, title: %s" % (url, title))

For example run_crawl('urllib2', 'lxml', urls) will crawl everything in the urls list using urllib2 and lxml.

One extra feature we would like to have is the ability to run this with multiple threads. We are very lucky that the crawling problem is trivially parallelizable and that we just have the thread-safe logger.info as output. This means that we won't need any synchronisation for our multi-threaded version of the crawler and all we need to do is just split the URLs to a number of independent jobs. We can do it like this:

import threading

def call_with_threading(crawl, parse, nthreads, urls):
    threads = []

    # Round-up the job length
    job_length = (len(urls)+nthreads-1)/nthreads

    for i in xrange(0, len(urls), job_length):
        urls_part = urls[i:i+job_length]
        t = threading.Thread(target=run_crawl, args=[crawl, parse, urls_part])
        threads.append(t)
        t.start()

    for t in threads:
        t.join()

Finally all we need to do is to call one of the two versions from a main() function:
import sys

def get_urls():
    return ["https://en.wikipedia.org/wiki/Special:Random"
            for i in xrange(1000)]

def main(crawl, parse, nthreads):
    if nthreads < 0:
        sys.exit(1)
    elif nthreads == 0:
        run_crawl(crawl, parse, get_urls())
    else:
        call_with_threading(crawl, parse, nthreads, get_urls())

That's more or less it. By adding a few more lines of argparse code we can use command line arguments to configure crawler features:
import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Process some integers.')

    parser.add_argument('--crawl', choices=["urllib2", "requests"],
                        default="urllib2",
                        help='download method - urllib2 (default) or requests')
    parser.add_argument('--parse', choices=["lxml", "beautifulsoup"],
                        default="lxml",
                        help='parse method - lxml (default) or beautifulsoup')
    parser.add_argument('--nthreads', help='Number of threads (default 0)',
                        type=int, default=0)

    args = parser.parse_args()

    main(**vars(args))

If you copy-paste the code above to a python script e.g. scrape_wikipedia.py, you will be able to run a couple of different experiments. Here are some examples:


Comparison


Here's a quick overview of the comparison between Scrapy and its most distinctive rivals.


Let's delve a bit more on the details...

Performance benchmarking


In order to do a fair comparison, we run the Scrapy crawler and the crawler we just wrote with a number of threads ranging from 1 to 100 and compare the results. Here is the summary:


There is a school of thought that claims that Threads - especially when dealing with I/O-heavy tasks - are almost for free. There is certainly lots of work done towards having extremely efficient Thread execution, switching and synchronization mechanisms but at the end of the day, the devil is in the (implementation) detail. In our case, we can see that the difference between urllib2 and requests is minimal, in terms of scaling with the number of threads. This is evident by the fact that the lines that differ only on those libraries are very close together. On the other hand, lxml and Beautiful Soup (using the default html.parser implementation) seem to have quite different performance profiles. lxml proves to be quite more efficient and scalable. This most likely has to do with the tension those libraries put into synchronised resources but certainly detailed investigation is out of the scope here.
Scrapy is proved to be the most efficient in terms of Requests/Second overperforming the best alternative (requests + lxml) by at least 30% (notice the log scale in the plot). This means 30% less hosting bills if we are talking about a crawling-heavy deployment. Another thing to keep in mind is that Scrapy uses a single thread which means that you will notice at most 100% CPU usage if you have a look with top. The other libraries use more than one CPUs which also translates to way more expensive hosting. With the same CPU horsepower you could run more than one Scrapy instances by using e.g. scrapyd.

Ease of installation


Certainly the clear winner here is urllib2 which comes with Python by default. All the other libraries need explicit installation and have dependencies. Admittedly, Scrapy as a complete web-scraping framework has more dependencies than the others and one of its most "difficult" dependencies proves to be lxml (!) Indeed lxml - in order to guarantee high performance - often needs to be compiled natively which means pulling all sorts of compiler dependencies. Nevertheless, if you want to go through the (trivial in most cases) effort of installing libraries, use Scrapy to get the extra performance gains. If it's hard to install any (e.g. working with machines where you have limited permissions) then you are stuck with urllib2 which is perfectly fine for easy tasks but might get increasingly complex when one needs to deal with sessions, cookies, authentication etc.

Development experience


This is hard to compare out of the examples we've seen here. All the examples were simple enough with the exception of requests which was quite unfortunate since requests one of the most usable libraries. In any case, the most fun to play with is certainly the combination of Requests and Beautiful Soup. They are easy to draft a prototype with and their documentation is outstanding and full of examples. If you really want to write just 5 lines of code and have something that works, certainly, this is the setup that will work for you.

On the other hand one shouldn't underestimate the implications of multi-threaded programming which will emerge as soon as we start looking into more performant solutions. As soon as you enter this realm Scrapy becomes a way better alternative. Additionally if you have more complex requirements in terms of web-crawling behaviour, you might find that Scrapy implements them by default while with other frameworks you will have to write extra and often quite complicated code. In other words, always consider the features Scrapy provides by checking resources like this before you invest heavily into implementing your own solution.

Memory usage


Memory usage obviously depends on the number of threads. If we choose a point where we get decent performance out of all these libraries and frameworks i.e. around 20 threads or 20 concurrent requests in terms of Scrapy, we measure Scrapy using 245 Mb of memory, requests-based configurations using 315-380 Mb and urllib2-based ones using 305-350 Mb. It also turns out that Beautiful Soup always requires a bit more memory than lxml. In general all those numbers are on the same range and shouldn't significantly influence any decisions.

Output files and formats


Scrapy has an advantage here. It can output to several file formats and backends like S3 and FTP without the need of any coding. It's a bit unfair to compare individual Python libraries with Scrapy though since they are just specialised libraries and they don't claim nor are they expected to be full-blown web scraping frameworks.

Javascript support


Neither any of those libraries nor Scrapy has any brilliant support for Javascript. If you need non-manual handling you will have to use Selenium as we've previously seen.

Conclusion


Overall once again we turn out to be very pro-Scrapy but I still believe it's a fair and relatively objective comparison. To be perfectly clear, if you want to develop a simple quick prototype and you don't already know Scrapy, don't bother learning it. Use a single threaded implementation using requests and Beautiful Soup and you'll likely be done within minutes.

Requests, urllib2, lxml and Beautiful Soup are all specialised libraries with very focused functionality and they don't claim nor try to be complete web-scraping solutions. As soon as you need something more efficient, robust and performant, at very least study Scrapy's feature set thoroughly to be able to tell when exactly is the right time to transition to using it.