Sling Academy
Home/Python/Python: 2 Ways to Extract Plain Text from a Webpage

Python: 2 Ways to Extract Plain Text from a Webpage

Last updated: May 19, 2023

The raw HTML data of a webpage includes many things, from HTML tags, images, JavaScript codes, etc. Sometimes, you just need plain text for data analytics, machine learning, or something else. This article will show you 2 ways to get what you want. Don’t worry; the amount of code we need to write isn’t much.

Using Requests and BeautifulSoup 4

This approach requires 2 packages: beautifulsoup4 and requests. Install them by running the following command:

pip install beautifulsoup4 requests

Then use them like so:

import requests
from bs4 import BeautifulSoup

# We will extract plain text from this webpage
url = "https://api.slingacademy.com/v1/examples/sample-page.html"

# Get HTML source code of the webpage
response = requests.get(url)

# Parse the source code using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the plain text content
text = soup.get_text()

# Print the plain text
print(text)

The output is long. Here’s just a small portion of it:

Sample Page


A Sample Webpage
Welcome to Sling Academy
...

Using Scrapy and lxml

Install the package:

pip install Scrapy

Then implement the program as shown below:

import scrapy
from scrapy.crawler import CrawlerProcess
import lxml.html

# Create a spider
class ExampleSpider(scrapy.Spider):
    name = 'example'
    custom_settings = {
        'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
        'LOG_LEVEL': 'CRITICAL', # Only log critical errors
        'LOG_FILE': 'log.txt', # Save log to file
    }

    # Define a list of start URLs
    start_urls = [
        "https://api.slingacademy.com/v1/examples/sample-page.html"
    ]

    def parse(self, response):
        # Parse the response
        content = lxml.html.fromstring(response.body)

        # Strip unwanted elements like <script> and <head>
        lxml.etree.strip_elements(content, lxml.etree.Comment, "script", "head")

        # complete text
        plain_text = lxml.html.tostring(content, method="text", encoding="unicode")
        print(plain_text)

 

process = CrawlerProcess()
process.crawl(ExampleSpider)
process.start() 

A part of the output:

A Sample Webpage
Welcome to Sling Academy
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed quis leo vitae lorem tincidunt ullamcorper.
Donec euismod lacus id nisi luctus aliquet. Fusce ut nisl quis augue mattis condimentum. Morbi vitae lectus
sed eros consequat malesuada.
...

That’s it. Happy coding & have a nice day!

Next Article: Python: Extract and download all images from a web page

Previous Article: Extract all links from a webpage using Python and Beautiful Soup

Series: Python – Fun Examples

Python

You May Also Like

  • Introduction to yfinance: Fetching Historical Stock Data in Python
  • Monitoring Volatility and Daily Averages Using cryptocompare
  • Advanced DOM Interactions: XPath and CSS Selectors in Playwright (Python)
  • Automating Strategy Updates and Version Control in freqtrade
  • Setting Up a freqtrade Dashboard for Real-Time Monitoring
  • Deploying freqtrade on a Cloud Server or Docker Environment
  • Optimizing Strategy Parameters with freqtrade’s Hyperopt
  • Risk Management: Setting Stop Loss, Trailing Stops, and ROI in freqtrade
  • Integrating freqtrade with TA-Lib and pandas-ta Indicators
  • Handling Multiple Pairs and Portfolios with freqtrade
  • Using freqtrade’s Backtesting and Hyperopt Modules
  • Developing Custom Trading Strategies for freqtrade
  • Debugging Common freqtrade Errors: Exchange Connectivity and More
  • Configuring freqtrade Bot Settings and Strategy Parameters
  • Installing freqtrade for Automated Crypto Trading in Python
  • Scaling cryptofeed for High-Frequency Trading Environments
  • Building a Real-Time Market Dashboard Using cryptofeed in Python
  • Customizing cryptofeed Callbacks for Advanced Market Insights
  • Integrating cryptofeed into Automated Trading Bots