Fetching text from Wikipedia’s Infobox in Python

An infobox is a template used to collect and present a subset of information about its subject. It can be described as structured document containing a set of attribute–value pairs, and in Wikipedia, it represents a summary of information about the subject of an article.

So a wikipedia infobox is a fixed-format table usually added to the top right-hand corner of articles to represent a summary articles of that wiki page and sometimes to improve navigation to other interrelated articles.
[To know more about infobox ,Click here]

Web Scraping is a mechanism which helps to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.
There are several ways to extract information from the web. Using APIs is one of the best way to extract data from a website. Almost all large websites like Youtube Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scraping.

Sometimes, there is a need for scraping content of a Wikipedia page, while we are developing any project or using somewhere else. In this article, I’ll tell how to extract contents of the Wikipedia’s Infobox.

Basically, We can use two Python modules for scraping data:
Urllib2: It is a Python module which can be used for fetching URLs. urllib2 is a Python module for fetching URLs. It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. For more detail refer to the documentation page.

BeautifulSoup: It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages. Look at the documentation page of BeautifulSoup
BeautifulSoup does not fetch the web page for us. We can use urllib2 with BeautifulSoup library.

Now I am going to tell you a another easy way for scraping
Steps for the following:

The modules we will be using are:

    1)lxml :lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. (You can refer this to know more about lxml module)
    2)requests :Requests is an Apache2 Licensed HTTP library, written in Python.Requests will allow you to send HTTP/1.1 requests using Python. With it, you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.
    For more information on it, click here

I have used Python 2.7 here,

Make sure these modules are installed on your machine.
If not then on console or prompt you can install it using pip

# importing modules
import requests
from lxml import etree
# manually storing desired URL
# fetching its url through requests module  
req = requests.get(url) 
store = etree.fromstring(req.text)
# this will give Motto portion of above 
# URL's info box of Wikipedia's page
output = store.xpath('//table[@class="infobox vcard"]/tr[th/text()="Motto"]/td/i'
# printing the text portion
print output[0].text  
# Run this program on your installed Python or 
# on your local system using cmd or any IDE.

See this link,it will display ‘Motto section’ of this wikipedia’s page infobox.(as shown in this screenshot)

Your browser is not supported.

Write your code first of all

Now finally after running the program you get,

You can also modify URL and store.xpath to get different sections of infobox.
If you want to learn more about web scraping, go to these links,
1) Web Scraping 1
2) Web Scraping 2

This article is attributed to GeeksforGeeks.org

You Might Also Like

leave a comment



load comments

Subscribe to Our Newsletter