Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network status Careers

hello@oxylabs.io

English (EN)

English

中文

Proxies

Proxies & Advanced Proxy Solutions

Residential Proxies

Human-like scraping without IP blocking

Mobile Proxies

Harness the power of IP addresses from real mobile devices

Rotating ISP Proxies

Extract the required data without the fear of getting blocked

Web Unblocker

AI-powered proxy solution for block-free scraping

Shared Datacenter Proxies

Fast and reliable proxies for cost-effective scraping

Dedicated Datacenter Proxies

The highest performing proxies on the market

Static Residential Proxies

Combined power of Datacenter and Residential IPs

Tools & Addons

Oxy Proxy Extension for Chrome

Free Chrome proxy manager extension that works with any proxy provider.

Oxy Proxy Manager for Android

Free Android proxy manager app that works with any proxy provider.

Proxy RotatorAdd-on

Rotates your Datacenter Proxies to help increase success rates.

Scraper APIs

SERP Scraper APIFREE TRIAL

Scalable SERP data delivery from major search engines

E-Commerce Scraper APIFREE TRIAL

Enterprise-level data from largest e-commerce marketplaces

Real Estate Scraper APIFREE TRIAL

Real-time data from popular real estate websites

Web Scraper APIFREE TRIAL

Public data delivery from a majority of websites

Features

Web Crawler

Discovers all pages on a website and fetches data at scale.

Scheduler

Schedules multiple scraping and parsing jobs at specified frequencies.

Custom Parser

Parses scraped documents by executing given parsing instructions.

Headless BrowserNEW

Render JavaScript and execute browser instructions.

DatasetsNew

Datasets

Company Data

Comprehensive datasets for business profiling

E-Commerce Product Data

Datasets for product catalog insights from E-Commerce stores

Job Postings Data

Datasets for labour market research and insights

Community and Code Data

Datasets for developer community trends

Product Review Data

Fresh datasets for user sentiment analysis

Pricing

Proxies

Residential Proxies

Human-like scraping

Starts from

$10

Pay as you go

Mobile Proxies

3G/4G/5G Mobile Proxies

Starts from

$22

Pay as you go

Rotating ISP Proxies

Extended sessions

Starts from

$340/month

Shared Datacenter Proxies

Cost-effective solution

Starts from

$50/month

Dedicated Datacenter Proxies

Superior performance

Starts from

$50/month

Scraper APIs

SERP Scraper API

Scalable SERP data delivery

Starts from

$49/month

E-Commerce Scraper API

Enterprise-level product page data

Starts from

$49/month

Web Scraper API

Data from a majority of websites

Starts from

$49/month

Real Estate Scraper API

Real-time real estate data

Starts from

$49/month

Advanced Proxy Solutions

Web Unblocker

AI-powered proxy solution

Starts from

$75/month

Learn

Getting Started

Knowledge Base

Read the latest articles about the world of web scraping, proxies, and more

Webinars

Check our webinars to learn more about data gathering issues and solutions

White papers

Get extensive white papers to understand the most complex scraping topics

OxyCon

Join inspiring discussions at Oxylabs’ annual web scraping conference

Scraping Experts

Watch lessons by industry-leading experts to gain insights on data gathering

Useful Information

Quick Start Guides

Featured

Explore tutorials and code samples to build a web scraping infrastructure with Oxylabs solutions.

Solutions

By Industry

E-Commerce

Get access to valuable e-commerce data with the help of advanced scraping solutions

Cybersecurity

Collect threat intelligence and inspect risky activities anonymously with reliable proxies

Brand protection

Monitor the web on a large scale to ensure no unauthorized product seeped into the market

SERP Monitoring

Monitor SERPs to enhance your business strategy

Travel and hospitality

Gather real-time flight and hotel data to and build a solid strategy for your travel business.

By Use Case

View all

By Target

View all

Back to blog

Tutorials

Automated Web Scraper With Python AutoScraper [Guide]

Roberta Aukstikalnyte

2022-09-285 min read

If you’re looking for a way to get public web data regularly scraped at a set time period, you’ve come to the right place. This tutorial will show you how to automate your web scraping processes using AutoScaper – one of the several Python web scraping libraries available.

Before getting started, you may want to check out this in-depth guide for building an automated web scraper using various web scraping tools supported by Python.

Now, let’s get into it.

Automated web scraping with Python AutoScraper library

AutoScraper is a web scraping library written in Python3; it’s known for being lightweight, intelligent, and easy to use – even beginners can use it without an in-depth understanding of a web scraping.

AutoScraper accepts the URL or HTML of any website and scrapes the data by learning some rules. In other words, it matches the data on the relevant web page and scrapes data that follow similar rules.

Methods to install AutoScraper

First things first, let’s install the AutoScraper library. There are actually several ways to install and use this library, but for this tutorial, we’re going to use the Python package index (PyPI) repository using the following pip command:

pip install autoscraper

Scraping Books to Scrape with AutoScraper

This section showcases an example to auto scrape public data with the AutoScraper module in Python using the Books to Scrape website as a subject.

The subject website has almost a thousand books in different categories. As the screenshot shows, the links to all the book category pages are available on the page's left section:

Scraping books category URLs

Now, if you want to scrape the links to all the category pages, you can do it following with the following trivial code:

from autoscraper import AutoScraper
 
url_to_scrape="https://books.toscrape.com"
WantedList=["https://books.toscrape.com/catalogue/category/books/travel_2/index.html"]
 
Scraper = AutoScraper()
ScrapedData = Scraper.build(UrlToScrape, wanted_list=WantedList)
print (ScrapedData)

In the code above, we first import AutoScraper from the autoscraper library. Then, we provide the URL from which we want to scrape the information in the UrlToScrap.

The WantedList is assigned sample data that we want to scrape from the given subject URL. To get all the category page links from the target page, we need to give only one example data element to the WantedList. Therefore, we only provide a single link to the Travel category page as a sample data element.

The AutoScraper() creates an AutoScraper object to initiate different functions of the autoscraper library. The Scraper.build() method scrapes the data similar to the wanted_list from the target URL.

After executing the Python script above, the ScrapedData list will have all the category page links available at https://books.toscrape.com. The output of the script should look something like this:

['https://books.toscrape.com/catalogue/category/books/travel_2/index.html', 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html', 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html', 'https://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',

Scraping book information from a single webpage

So far, we’ve looked at the method of extracting URLs from a page, but we still need to learn about scraping portions of data. This section discusses the use of AutoScraper to scrape data from a book stored on a specific page.

Say that we want to get the title of the book along with its price; we can train and build an AutoScraper model as:

from autoscraper import AutoScraper
UrlToScrap="https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"
WantedList=["It's Only the Himalayas", "£45.17"]
 
InfoScraper = AutoScraper()
InfoScraper.build(UrlToScrap, wanted_list=WantedList)

The script above feeds a URL of the book page and a sample of required information from that page to the AutoScraper model. The build() method learns the rules to scrape the information and prepares our InfoScraper for future use.

Now, let’s apply this InfoScraper tactic to a different book’s URL and see if it returns the desired information.

another_book_url = 'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html'

scraped_data = InfoScraper.get_result_similar(Another_book_url)
print (scraped_data)

Output:

['Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond', 'ce60436f52c5ee68', 'Books', '£49.43', '£0.00', 'In stock (15 available)', '0']

The script above applies InfoScraper to another_book_url and prints the scraped_data. Notice that the scraped data has some unnecessary information along with the desired information. This is due to the get_result_similar() method, which returns information similar to the wanted_list.

another_book_url = 'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html'

scraped_data = InfoScraper.get_result_exact(Another_book_url)
print (scraped_data)

Output:

['Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond', '£49.43']

Here, we used the get_result_exact() method to ensure accurate, in-order retrieval of the book title and price as defined by wanted_list.

Scraping all the books on a specific category

Until now, we’ve learned to extract similar and exact information from a specific webpage, including URLs. Let’s learn how to scrape data from all the books in one specific category. It can be done by using two scrapers: one for scraping URLs of all the books in this category and the other for scraping information from each link.

Let’s turn this strategy into action using the following Python script:

#BooksByCategoryScraper.py
from autoscraper import AutoScraper
import pandas as pd
 
#BooksUrlScraper section
TravelCategoryLink = 'https://books.toscrape.com/catalogue/category/books/travel_2/index.html'
WantedList=["https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"]
BooksUrlScraper = AutoScraper()
BooksUrlScraper.build(TravelCategoryLink, wanted_list=WantedList)
 
#BookInfoScraper section
BookPageUrl = "https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html" 
WantedList=["It's Only the Himalayas", "£45.17"]
 
BookInfoScraper = AutoScraper()
BookInfoScraper.build(BookPageUrl, wanted_list=WantedList)
 
#Scraping info of each book and storing into an excel file
BooksUrlList = BooksUrlScraper.get_result_similar(TravelCategoryLink) 
BooksInfoList = []
for Url in BooksUrlList:
  book_info= BookInfoScraper.get_result_exact(Url)
  BooksInfoList.append(book_info)
df = pd.DataFrame(BooksInfoList, columns =["Book Title", "Price"])
df.to_excel("BooksInTravelCategory.xlsx")

The script above has three main constituents: two sections for building the scrapers and the third one to scrape data from all the books on the Travel category and save it as an Excel file.

For this step, we’ve built BooksUrlScraper to scrape all the similar book links on the Travel Category page. These eleven links are stored in the BooksUrlList. Now, for each URL in the BookUrlList, we apply the BookInfoScraper and append the scraped information to the BooksInfoList. Finally, the BooksInfoList is converted to a data frame, then exported as an Excel file for future use.

Output:

The output reflects achieving the initial goal – scraping titles and prices of all the eleven books on the Travel category.

Now, we know how to use a combination of multiple AutoScraper models to scrape data in bulk. You can re-formulate the script above to scrape all the books from all the categories and save them in different Excel files for each category.

How to use AutoScraper with proxies

Proxies are an integral part of the web scraping process: acquiring data without using them involves various risks, such as the target website blocking your IP address. Let’s take a look at the process of using proxies with AutoScraper.

The build function of AutoScraper accepts request-related arguments in the request_args parameter.

Here’s what using AutoScraper with proxy IPs looks like in practice:

from autoscraper import AutoScraper
UrlToScrap="https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"
WantedList=["It's Only the Himalayas", "£45.17"]
 
proxy = {
    "http": 'PROXY_ENDPOINT_HERE',
    "https": 'PROXY_ENDPOINT_HERE',
}
InfoScraper = AutoScraper()
InfoScraper.build(UrlToScrap, wanted_list=WantedList, request_args={"proxies": proxy})

Here, proxy_endpoint refers to the address of a proxy server in the correct format (e.g., http://127.0.0.1:8081). The script above should work fine when proper proxy endpoints are added to the proxy dictionary.

Saving and loading an AutoScraper model

AutoScraper provides the ability to save and load a pre-trained scraper. We can use the following script to save the InfoScraper object to a file:

InfoScraper.save('file_name')

Similarly, we can load a scraper using:

SavedScraper = AutoScraper()
SavedScraper.load('FileName')

Now that we’ve built the automated web scraper let’s move to the last portion of the tutorial – managing automation mechanisms.

Alternative options for web scraping automation

This section will discuss the alternatives for scheduling Python scripts on macOS, Unix/Linux, and Windows operating systems.

Say you want your scraper to periodically visit the Travel category page and scrape every new book uploaded – it can be done by scheduling the BooksByCategoryScraper.py script.This script, whenever executed, scrapes data from all the books on the Travel category page and returns it in an Excel file.

You can schedule a Python script through:

Schedule module in Python: tutorial
Adding it to the crontab (cron table): tutorial
Creating a demon or background service through systemd: tutorial
Task Scheduler in Windows: tutorial

Crontab and systemd (system demon) methods are specific to the Unix-based operating systems, including Linux and macOS. Meanwhile, the Task Scheduler helps to schedule a Python script on Windows.

Frequently Asked Questions

What is the difference between a web crawler and a web scraper?

Simply put, a web scraper is a tool for extracting data from one or more websites; meanwhile, a crawler finds or discovers URLs or links on the web.

Can you manually edit or remove rules for AutoScraper objects?

Definitely, you can use the keep_rules() and remove_rules() methods to keep the required rules and remove the unwanted rules, respectively. More details can be found here.

About the author

Roberta Aukstikalnyte

Senior Content Manager

Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.

Learn more about Roberta Aukstikalnyte

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.