JavaScript: Ways to Remove CSS and Scripts from Raw HTML

Updated: February 1, 2024 By: Guest Contributor Post a comment

Introduction

In the development of web applications, it is sometimes necessary to strip out CSS and scripts from raw HTML. This process could be required for reasons such as security (preventing execution of malicious scripts), data extraction, or transforming contents for a different environment where CSS and scripts are not needed.

Understanding the Situation

When dealing with raw HTML, it typically comes from one of two sources: server response data or directly from the DOM (Document Object Model). In either case, the HTML is usually a string containing a mix of elements, text nodes, script tags, link tags (for CSS), and style attributes/tags.

The methods of removing CSS and scripts vary depending on the situation:

  • Using browser-based JavaScript to manipulate the DOM.
  • Handling HTML strings within a Node.js environment.

Let’s examine ways to handle this in both contexts.

Browser-based JavaScript

Removing Script Tags

const container = document.createElement('div');
container.innerHTML = rawHTML;
container.querySelectorAll('script').forEach(script => script.remove());

In the above example, we create an element as a container and then inject the raw HTML as its innerHTML. This parses the HTML into a temporary DOM structure within the container. We then select and remove all script tags using querySelectorAll and the remove method.

Removing Link Tags for CSS

container.querySelectorAll('link[rel="stylesheet"]').forEach(link => link.remove());

We follow a similar procedure to remove <link> tags that link to external CSS stylesheets.

Removing Inline Styles

container.querySelectorAll('[style]').forEach(el => el.removeAttribute('style'));

This removes the style attribute from any element, effectively stripping away any inline CSS.

Removing Style Tags

container.querySelectorAll('style').forEach(style => style.remove());

Any <style> tags within the HTML are also removed using the same technique.

The processed HTML can then be extracted using container.innerHTML.

Node.js Environment

When manipulating raw HTML within a Node.js environment, the same approach doesn’t work because we don’t have straightforward access to the DOM like we do in the browser. Instead, we use modules like jsdom or cheerio that emulate a DOM-like environment on the server.

Using jsdom

const { JSDOM } = require('jsdom');
const dom = new JSDOM(rawHTML);
const { window } = dom;

const scriptElements = window.document.querySelectorAll('script');
scriptElements.forEach(script => script.remove());

The JSDOM constructor parses the HTML string and provides a `window` object that simulates the browser’s window. We use this to remove script elements.

Using Cheerio

const cheerio = require('cheerio');
const $ = cheerio.load(rawHTML);
$('script').remove();
$('link[rel="stylesheet"]').remove();
$('[style]').removeAttr('style');
$('style').remove();

Cheerio provides a jQuery-like API for the server. We can use familiar jQuery syntax to manipulate the loaded HTML string.

Considerations

When you’re removing scripts and styles:

  • Always validate the incoming HTML to prevent against XSS attacks.
  • Ensuring performance if handling large HTML documents is important—consider stream-based processing.
  • Be aware of the collateral effect on the functionality of the HTML content after removing scripts and styles.

In conclusion, JavaScript offers multiple ways to remove CSS and scripts from raw HTML, suitable for both browser-based applications and server-side processing with Node.js. The methods shared in this article serve as a foundation, and you can adapt them to fit the specific requirements of your projects.