Scraping Premium News Articles

Scraping Premium News Articles

Using Python and BeautifulSoup4

Web scraping is the process of extracting data from websites by using automated tools or scripts. It allows users to gather specific information, such as news articles, job offers, gadget prices, reviews, etc. Commonly, webscraping is done for data analysis, research, and monitoring purposes.

Modern-day websites can be classified into:

Static websites

These websites display fixed content for all users. The information remains constant and doesn't change unless the webmaster manually edits the HTML source code. URLs often have clear, straightforward structures without parameters. Each page typically corresponds to a specific file or directory.

Dynamic websites

Dynamic websites generate content on the fly, customizing it based on user interactions or other variables. They often use server-side languages, databases, and scripting languages to create a more interactive and personalized experience. URLs may contain parameters or variables, reflecting the dynamic nature of the content. These parameters often influence what content is displayed.

Both dynamic and static mixed websites

Dynamic websites combine elements of both static and dynamic websites. They may have certain pages or sections that are static, while others are generated dynamically. This allows for a balance between personalized content and efficient loading times. These mixed websites often utilize caching techniques to optimize performance and deliver a seamless user experience.

Ways to identify website types:

Static website URLs often have clear, straightforward structures without parameters. Each page typically corresponds to a specific file or directory. The source code is relatively simple and doesn't involve server-side scripting languages like PHP, Python, or Node.js. Limited interactivity, usually basic hyperlinks and forms

Dynamic websites URLs may contain parameters or variables, reflecting the dynamic nature of the content. These parameters often influence what content is displayed. You may find server-side scripting languages in the source code, indicating that content is generated dynamically. Rich interactivity, AJAX requests, and real-time updates are common.

You should be able to tell whether a website is dynamic or static by looking at these indicators. It will take practice and a thorough examination of the HTML code to identify this sort of website.

For scraping both dynamic and static websites, different libraries are often used. Here are some popular choices:

  • Static Websites:

    • Beautiful Soup: This Python library is excellent for parsing HTML and XML documents. It's commonly used for extracting data from static web pages.
from bs4 import BeautifulSoup
import requests

url = 'your_static_website_url'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Now you can navigate and extract information from 'soup'.

Dynamic Websites:

    • Selenium: When dealing with dynamic content, especially that loaded through JavaScript, Selenium is a powerful tool. It automates browsers, allowing you to interact with the web page as a user would.

from selenium import webdriver
​
url = 'your_dynamic_website_url'
driver = webdriver.Chrome()  # You need to have the appropriate WebDriver installed
driver.get(url)
​
# Now you can interact with the dynamically loaded content using Selenium.
​
  • Scrapy with Splash: Scrapy is a versatile web crawling framework, and when combined with Splash, it can handle dynamic content effectively.

import scrapy
from scrapy_splash import SplashRequest
​
class MySpider(scrapy.Spider):
    # Your spider configuration here...
​
    def start_requests(self):
        url = 'your_dynamic_website_url'
        yield SplashRequest(url, self.parse, args={'wait': 0.5})
​
    def parse(self, response):
        # Parse the dynamically loaded content here.
​

Choose the library based on your specific requirements, and consider factors such as ease of use, community support, and the nature of the website you are scraping.

Let's dive into the actual web scraping project. For demonstration purposes, we are now going to scrape a premium paywalled article from indianexpress.com using Python and the BeautifulSoup4 library. The reason for choosing this website for a demo is that it is static and simple.

I assume you have little knowledge about web scraping and programming and have installed Python 12.x.

Project Workflow:

  1. Create a project directory in Windows Explorer and open it in VS Code.

  2. Press Ctrl+Shift+P to open the command palette and create a new virtual environment by selecting the option "Python:create virtual environment." As a convention, it is recommended to name the virtual environment folder ".venv.".

  3. Once the virtual environment is created, activate it by running the command.venv/Scripts/activate in the VS Code terminal.

  4. Install the required libraries. pip install beautifulSoup4

  5. visit gitignore.io and generate a gitignore file for your project.

  6. Copy the content of the gitignore file and paste it into a new file named .gitignore in the root directory of your project.

  7. initialize git git init

  8. Create a file in the root directory named main.py.

  9. Copy and paste the entire code below into the main.py file and save it..

  10. After successful scraping, save the installed packages and their versions in "Requirements.txt" in the root directory of your project. Run this command in the VS Code terminal: pip freeze > Requirements.txt.

This code will extract the text from the article and save it as an HTML file named "article-header as name.html" in the root directory of your project.

import requests
from bs4 import BeautifulSoup
import re
from html import escape

# URL of the web page
url = "https://indianexpress.com/article/explained/explained-climate/kashmir-ladakh-without-snow-why-implications-9110841/"

# Make a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract text from the header <h1></h1> tag
    section_content = soup.find('div', {'id': 'section', 'class': 'ie_single_story_container', 'data-env': 'production'})
    h1_content = section_content.find('h1').text.strip()

    pcl_full_content = soup.find('div', {'id': 'pcl-full-content', 'class': 'story_details'})

    # Check if sub_header <h2></h2> tags are found before accessing its text attribute
    h2_element = pcl_full_content.find('h2')
    h2_content = h2_element.text.strip() if h2_element else ""

    # Find all <p></p> tags and extract their text content
    p_elements = pcl_full_content.find_all('p')
    p_contents = [p.text.strip() for p in p_elements]

    # Create a valid filename by replacing invalid characters with underscores
    valid_filename = re.sub(r'[\/:*?"<>|]', '_', h1_content.lower()) + '.html'

    # Save as HTML file with the valid filename
    with open(valid_filename, 'w', encoding='utf-8') as file:
        file.write(f"<!DOCTYPE html>\n<html>\n<head>\n\t<title>{escape(h1_content)}</title>\n</head>\n<body>\n")
        file.write(f"\t<h1>{escape(h1_content)}</h1>\n")
        file.write(f"\t<h2>{escape(h2_content)}</h2>\n")
        for p_content in p_contents:
            file.write(f"\t<p>{escape(p_content)}</p>\n")
        file.write("</body>\n</html>")

    print(f"Content saved to {valid_filename}")
else:
    print("Failed to retrieve the web page. Status code:", response.status_code)