Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

How to Scrape Wikipedia Data: Ultimate Tutorial

Roberta Aukstikalnyte

2023-08-254 min read
Share

Wikipedia is known as one of the biggest online sources of information, covering a wide variety of topics. Naturally, it possesses a ton of valuable information for research or analysis. However, to obtain this information at scale, you’ll need specific tools and knowledge. 

In this article, we’ll give answers to questions like “Is scraping Wikipedia allowed?” or “What exact information can be extracted?”. In the second portion of the article, we’ll give the exact steps for extracting publicly available information from Wikipedia using Python and Oxylabs’ Wikipedia API (part of Web Scraper API). We’ll go through steps for extracting different types of Wikipedia article data, such as paragraphs, links, tables, and images.

Let’s get started. 

1. Connecting to the Web Scraper API

Let's start by creating a Python file for our scraper:

touch main.py

Within the created file, we’ll begin assembling a request for the Web Scraper API:

USERNAME = 'yourUsername'
PASSWORD = 'yourPassword'

# Structure payload.
payload = {
   'source': 'universal',
   'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
}

Variables USERNAME and PASSWORD will contain our credentials for authentication to the API. Payload will have all the various query parameters that the API supports. In our case, we have to specify the source and the url. Source denotes the type of scraper that should be used for the processing of this request and the universal one will work just fine for us. Url will tell the API what link to scrape. For more information about the various possible parameters, check out the Web Scraper API documentation.

After specifying the information required for the API, we can form and send the request:

response = requests.request(
   'POST',
   'https://realtime.oxylabs.io/v1/queries',
   auth=(USERNAME, PASSWORD),
   json=payload,
)

print(response.status_code)

If we set everything up correctly, the code should print out 200 as the status.

2. Extracting specific data

Now that we can send requests to the API, we can start scraping specific data that we require. Without specific instructions, our scraper will return us a raw lump of HTML. But we can use the functionality of Custom Parser to specify and get exactly what we want. Custom Parser is a free Scraper APIs feature that lets you define your own parsing and data processing logic and parse HTML data.

1) Paragraphs

To start off, we can get the most obvious one - paragraphs of text. For that, we will need to find the CSS selector for the paragraphs of text. Inspecting the page, we can see that they are inside the paragraph element.

We can edit our payload to the API like this:

payload = {
   'source': 'universal',
   'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
   'parse': 'true',
   "parsing_instructions": {
       "paragraph_text": {
           "_fns": [
               {"_fn": "css", "_args": ["p"]},
               {"_fn": "element_text"}
           ]
       }
   }
}

response = requests.request(
  'POST',
  'https://realtime.oxylabs.io/v1/queries',
  auth=(USERNAME, PASSWORD),
  json=payload,
)

print(response.json())

Here, we pass two additional parameters: parse indicates that we want to use a Custom Parser with our request and parsing_instructions allow us to specify what needs to be parsed. In this case, we add a function that fetches an element by CSS and another one that extracts the text of the element.

After running the code, we can see the response: 

2) Links

To get the links scraped, first, we will need the CSS selector for them. As we can see, they are inside an HTML element named a.

To fetch these links now, we have to edit our Custom Parser as follows:

payload = {
   'source': 'universal',
   'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
   'parse': 'true',
   "parsing_instructions": {
       "links": {
           "_fns": [
               {"_fn": "css", "_args": ["a"]},
           ]
       },
   }
}

response = requests.request(
  'POST',
  'https://realtime.oxylabs.io/v1/queries',
  auth=(USERNAME, PASSWORD),
  json=payload,
)

print(response.json())

If we look at the response, we can see that the links are inside the full HTML element. And sometimes they are not full links, but rather relative ones.

...
'<a href="/wiki/Wikipedia:About">About Wikipedia</a>', 
'<a href="/wiki/Wikipedia:General_disclaimer">Disclaimers</a>', 
'<a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us">Contact Wikipedia</a>', 
...

Let’s write some code that would retrieve the links and put them inside one collection:

from bs4 import BeautifulSoup

def extract_link(html):
   soup = BeautifulSoup(html,'html.parser')
   link = soup.find('a').get('href')
   # Check if the link is relative
   if link != None and 'https://' not in link: 
       link = 'https://en.wikipedia.org' + link
   return link

raw_scraped_data = response.json()

processed_links = []
for result_set in raw_scraped_data['results']:
   links = list(map(extract_link, result_set['content']['links']))
   processed_links = links

print(processed_links)

First, we define a function extract_link. It uses BeautifulSoup to extract the value of the href attribute from the HTML element. Then, we check if the link is relative and modify it if it is.

Now, we take the raw scraped links from the API response that we got and iterate through the results. This way, we can use a convenient Python function called map() to pass each raw HTML element through the function that we described earlier and get an array of clean links.

3) Tables

The next bit of information we could gather is from tables. Let's find a css selector for them.

A table as an HTML element is used in multiple ways across the page, but we will limit ourselves to extracting the information from the tables that are within the body of text. As we can see, such tables are of a Wikitable CSS class.

We can begin writing our code:

def extract_tables(html):
   list_of_df = pandas.read_html(html)
   tables = []
   for df in list_of_df:
       tables.append(df.to_json(orient='table'))
   return tables

raw_scraped_data = response.json()
processed_tables = {}

for result_set in raw_scraped_data['results']:
   tables = list(map(extract_tables, result_set['content']['table']))
   processed_tables = tables

print(processed_tables)

The extract_tables function accepts the HTML of a table element and parses the table into a JSON structure with the help of the Pandas Python library.

Then, as previously done with the links, we iterate over the results, map each table to our extraction function, and get our array of processed information.

Note: a table doesn’t always directly translate into a JSON structure in an organized way, so you might want to customize the information you gather from the table or use another format.

4) Images

The final elements we’ll be retrieving are image sources. Time to find their CSS selectors.

We can see that images have their special HTML element called img. The only thing left is to write the code.

def extract_img(html):
   soup = BeautifulSoup(html,'html.parser')
   img = soup.find('img').get('src')
   return img

raw_scraped_data = response.json()
processed_images = []

for result_set in raw_scraped_data['results']:
   images = list(map(extract_img, result_set['content']['images']))
   processed_images = images

print(processed_images)

We begin by defining the extract_img function, which takes an HTML and gets the image source from it. Then, we iterate through the response from the API and map each img HTML element to our extraction function. This leaves us with an array of processed image sources.

3. Joining everything together

Now that we went over extracting a few different pieces of Wikipedia page data, let’s join it all together and add some saving to a file for a finalized version of our scraper:

import requests
import json
from bs4 import BeautifulSoup
import pandas

def extract_link(html):
   soup = BeautifulSoup(html,'html.parser')
   link = soup.find('a').get('href')
   # Check if the link is relative
   if link != None and 'https://' not in link: 
       link = 'https://en.wikipedia.org' + link
   return link

def extract_img(html):
   soup = BeautifulSoup(html,'html.parser')
   img = soup.find('img').get('src')
   return img

def extract_tables(html):
   list_of_df = pandas.read_html(html)
   tables = []
   for df in list_of_df:
       tables.append(df.to_json(orient='table'))
   return tables

# Structure payload.
payload = {
   'source': 'universal',
   'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
   "parse": 'true',
   "parsing_instructions": {
       "paragraph_text": {
           "_fns": [
               {"_fn": "css", "_args": ["p"]},
               {"_fn": "element_text"}
           ]
       },
       "table": {
           "_fns": [
               {"_fn": "css", "_args": ["table.wikitable"]},
           ]
       },
   "links": {
           "_fns": [
               {"_fn": "css", "_args": ["a"]},
           ]
       },
       "images": {
           "_fns": [
               {"_fn": "css", "_args": ["img"]},
           ]
       }
   }
}

# Create and send the request
USERNAME = 'yourUsername'
PASSWORD = 'yourPassword'

response = requests.request(
   'POST',
   'https://realtime.oxylabs.io/v1/queries',
   auth=(USERNAME, PASSWORD),
   json=payload,
)

raw_scraped_data = response.json()
processed_data = {}

for result_set in raw_scraped_data['results']:
   links = list(map(extract_link, result_set['content']['links']))
   processed_data['links'] = links

   tables = list(map(extract_tables, result_set['content']['table']))
   processed_data['tables'] = tables

   images = list(map(extract_img, result_set['content']['images']))
   processed_data['images'] = images

   processed_data['paragraphs'] = result_set['content']['paragraph_text']

json_file_path = 'data.json'

with open(json_file_path, 'w') as json_file:
   json.dump(processed_data, json_file, indent=4) 

print(f'Data saved as {json_file_path}')

Conclusion

We hope that you found this web scraping tutorial helpful. Scraping public information from Wikipedia pages with Oxylabs’ Web Scraper API is a rather straightforward process. However, if you run into any additional questions about web scraping, be sure to contact us at support@oxylabs.io and our professional customer support team will happily assist you.

Frequently asked questions

Is it possible to scrape Wikipedia?

You can gather publicly available information from Wikipedia with automated solutions, such as proxies or web scraping infrastructure, such as Oxylabs’ Web Scraper API or a custom-built scraper.

Can you legally scrape Wikipedia?

When it comes to the legality of web scraping, it mostly depends on the type of data and whether it’s considered public. The data on Wikipedia articles is usually publicly available, so you should be able to scrape it. However, we always advise you to seek out professional legal assistance regarding your specific use case.

How do I extract information from Wikipedia?

To scrape public Wikipedia page data, you’ll need an automated solution like Oxylabs’ Web Scraper API or a custom-built scraper. Web Scraper API is a web scraping infrastructure that, after receiving your request, gathers publicly available Wikipedia page data according to your request.

How do I extract text from Wikipedia in Python?

You can extract text from HTML elements by using the Oxylabs’ Custom Parser feature and the {"_fn": "element_text"} function – for more information, check out the documentation. Alternatively, you can build your own parser, but it’ll take more time and resources. 

About the author

Roberta Aukstikalnyte

Senior Content Manager

Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE:


  • 1. Connecting to the Web Scraper API

  • 2. Extracting specific data

  • 3. Joining everything together

  • Conclusion

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

Scale up your business with Oxylabs®