Extract all links from a webpage using Python and Beautiful Soup

Updated: January 15, 2023 By: Goodman Post a comment

This article shows you how to get all links from a webpage using Python 3, the requests module, and the Beautiful Soup 4 module. For the demonstration purpose, I will scrape and extract the main page of Wikipedia:

https://en.wikipedia.org/wiki/Main_Page

Please note that not all websites allow you to crawl content from them.

Installing Packages

Install the required modules by running the following commands:

pip install requests

and:

pip install beautifulsoup4

If you’re using a Mac, you may need to type pip3 instead of pip.

The Code

We will proceed through the following steps:

  1. Download the HTML source from the webpage by using requests
  2. Parse the HTML and extract links using Beautiful Soup
  3. Print out the result

Here’s the code:

import requests

# BeautifulSoup is imported with the name bas4 
import bs4

URL = 'https://en.wikipedia.org/wiki/Main_Page'

# Fetch all the HTML source from the url
response = requests.get(URL)

# Parse HTML and extract links
soup = bs4.BeautifulSoup(response.text, 'html.parser')
links = soup.select('a')

# Print out the result
for link in links:
  print(link.get_text())
  if link.get('href') != None:
    if 'https://' in link.get('href'):
      print(link.get('href'))
    else:
      print('https://en.wikipedia.org' + link.get('href')) # Convert relative URL to absolute URL

  print('---') # Just a line separator

When running that program, you should see something like the following:

https://en.wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
---
Terms of Use
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
---
Privacy Policy
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Privacy_policy
---
Wikimedia Foundation, Inc.
https://en.wikipedia.org//www.wikimediafoundation.org/
---
Privacy policy
https://foundation.wikimedia.org/wiki/Privacy_policy
---
About Wikipedia
https://en.wikipedia.org/wiki/Wikipedia:About
---
Disclaimers
https://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
---
Contact Wikipedia
https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us
---
Mobile view
https://en.wikipedia.org//en.m.wikipedia.org/w/index.php?title=Main_Page&mobileaction=toggle_view_mobile
---
Developers
https://developer.wikimedia.org
---
Statistics
https://stats.wikimedia.org/#/en.wikipedia.org
---
Cookie statement
https://foundation.wikimedia.org/wiki/Cookie_statement
---

https://wikimediafoundation.org/
---

https://www.mediawiki.org/
---
[nodemon] clean exit - waiting for changes before restart
[nodemon] restarting due to changes...
[nodemon] starting `python main.py`

---
Jump to navigation
https://en.wikipedia.org#mw-head
---
Jump to search
https://en.wikipedia.org#searchInput
---
Wikipedia
https://en.wikipedia.org/wiki/Wikipedia
---
free
https://en.wikipedia.org/wiki/Free_content
---
encyclopedia
https://en.wikipedia.org/wiki/Encyclopedia
---
anyone can edit
https://en.wikipedia.org/wiki/Help:Introduction_to_Wikipedia
---
6,602,987
https://en.wikipedia.org/wiki/Special:Statistics
---
English
https://en.wikipedia.org/wiki/English_language
---
James Ashley was killed
https://en.wikipedia.org/wiki/Shooting_of_James_Ashley
---
St Leonards-on-Sea
https://en.wikipedia.org/wiki/St_Leonards-on-Sea
---
East Sussex
https://en.wikipedia.org/wiki/East_Sussex
---
misconduct in public office
https://en.wikipedia.org/wiki/Malfeasance_in_office
---
negligence
https://en.wikipedia.org/wiki/Negligence
---
battery
https://en.wikipedia.org/wiki/Battery_(tort)
---
heard by the House of Lords
https://en.wikipedia.org/wiki/Judicial_functions_of_the_House_of_Lords
---
Full article...
https://en.wikipedia.org/wiki/Shooting_of_James_Ashley
---
Shannon Lucid
https://en.wikipedia.org/wiki/Shannon_Lucid
---
Zork
https://en.wikipedia.org/wiki/Zork
---
Farseer trilogy
https://en.wikipedia.org/wiki/Farseer_trilogy
---
Archive
https://en.wikipedia.org/wiki/Wikipedia:Today%27s_featured_article/January_2023
---
By email
https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/
---
More featured articles
https://en.wikipedia.org/wiki/Wikipedia:Featured_articles
---
About
https://en.wikipedia.org/wiki/Wikipedia:About_Today%27s_featured_article
---

https://en.wikipedia.org/wiki/File:Brig_Gen_R_B_Bradford%27s_grave_at_Hermies_British_Cemetery.jpg
---
Roland Bradford
https://en.wikipedia.org/wiki/Roland_Bradford
---
killed as a result of active service in the First World War
https://en.wikipedia.org/wiki/List_of_generals_of_the_British_Empire_who_died_during_the_First_World_War
---
Ismail Suko
https://en.wikipedia.org/wiki/Ismail_Suko
---
Riau
https://en.wikipedia.org/wiki/Riau
---
Eastern Michigan
https://en.wikipedia.org/wiki/Eastern_Michigan_Eagles_football
---
bowl-game
https://en.wikipedia.org/wiki/Bowl_game
---
San Jose State
https://en.wikipedia.org/wiki/San_Jose_State_Spartans_football
---
2022 Famous Idaho Potato Bowl
https://en.wikipedia.org/wiki/2022_Famous_Idaho_Potato_Bowl
---
1987 California Bowl
https://en.wikipedia.org/wiki/1987_California_Bowl
---
Kimiko Hirata
https://en.wikipedia.org/wiki/Kimiko_Hirata
---
Fukushima nuclear disaster
https://en.wikipedia.org/wiki/Fukushima_nuclear_disaster
---
Cretaceous
https://en.wikipedia.org/wiki/Cretaceous
---
Carsosaurus
https://en.wikipedia.org/wiki/Carsosaurus
---
Anglo-Indian
https://en.wikipedia.org/wiki/Anglo-Indian_people
---
Kenneth Powell
https://en.wikipedia.org/wiki/Kenneth_Powell_(sprinter)
---
Karnataka
https://en.wikipedia.org/wiki/Karnataka
---
Arjuna Award
https://en.wikipedia.org/wiki/Arjuna_Award
---
Casablanca Protocol
https://en.wikipedia.org/wiki/Casablanca_Protocol
---
Arab League
https://en.wikipedia.org/wiki/Arab_League
---
Palestinian refugees
https://en.wikipedia.org/wiki/Palestinian_refugees
---
Max Wenner
https://en.wikipedia.org/wiki/Max_Wenner
---
Archive
https://en.wikipedia.org/wiki/Wikipedia:Recent_additions
---
Start a new article
https://en.wikipedia.org/wiki/Help:Your_first_article
---
Nominate an article
https://en.wikipedia.org/wiki/Template_talk:Did_you_know
---

https://en.wikipedia.org/wiki/File:Invas%C3%A3o_do_pr%C3%A9dio_do_Congresso_Nacional_(52615636677).jpg
---
Jair Bolsonaro
https://en.wikipedia.org/wiki/Jair_Bolsonaro
---
invade
https://en.wikipedia.org/wiki/2023_invasion_of_the_Brazilian_Congress
---
National Congress
https://en.wikipedia.org/wiki/National_Congress_of_Brazil
---
Supreme Federal Court
https://en.wikipedia.org/wiki/Supreme_Federal_Court
---
Palácio do Planalto
https://en.wikipedia.org/wiki/Pal%C3%A1cio_do_Planalto
---
Michael Smith
https://en.wikipedia.org/wiki/Michael_Smith_(darts_player)
---
the PDC World Darts Championship
https://en.wikipedia.org/wiki/2023_PDC_World_Darts_Championship
---
adopts the euro
https://en.wikipedia.org/wiki/Croatia_and_the_euro
---
Schengen Area
https://en.wikipedia.org/wiki/Schengen_Area
---
Pope
https://en.wikipedia.org/wiki/Pope
---
Benedict XVI
https://en.wikipedia.org/wiki/Pope_Benedict_XVI
---
dies
https://en.wikipedia.org/wiki/Death_and_funeral_of_Pope_Benedict_XVI
---
Pelé
https://en.wikipedia.org/wiki/Pel%C3%A9
---
dies
https://en.wikipedia.org/wiki/Death_and_funeral_of_Pel%C3%A9
---
Ongoing
https://en.wikipedia.org/wiki/Portal:Current_events
---
Mahsa Amini protests
https://en.wikipedia.org/wiki/Mahsa_Amini_protests
---
Peruvian protests
https://en.wikipedia.org/wiki/2022%E2%80%932023_Peruvian_political_protests
---
Russian invasion of Ukraine
https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine
---
Recent deaths
https://en.wikipedia.org/wiki/Deaths_in_2023
---
Charles Simic
https://en.wikipedia.org/wiki/Charles_Simic
---
Lisa Marie Presley
https://en.wikipedia.org/wiki/Lisa_Marie_Presley
---
Victoria Lee
https://en.wikipedia.org/wiki/Victoria_Lee
---
Siegfried Kurz
https://en.wikipedia.org/wiki/Siegfried_Kurz
---
Sinikiwe Mpofu
https://en.wikipedia.org/wiki/Sinikiwe_Mpofu
---
Yuri Manin
https://en.wikipedia.org/wiki/Yuri_Manin
---
Nominate an article
https://en.wikipedia.org/wiki/Wikipedia:In_the_news/Candidates
---
January 15
https://en.wikipedia.org/wiki/January_15
---
John Chilembwe Day
https://en.wikipedia.org/wiki/John_Chilembwe
---
World Religion Day
https://en.wikipedia.org/wiki/World_Religion_Day
---

https://en.wikipedia.org/wiki/File:Derveni-papyrus.jpg
---
1815
https://en.wikipedia.org/wiki/1815
---
War of 1812
https://en.wikipedia.org/wiki/War_of_1812
---
frigate
https://en.wikipedia.org/wiki/Frigate
---
USS President
https://en.wikipedia.org/wiki/USS_President_(1800)
---
Stephen Decatur
https://en.wikipedia.org/wiki/Stephen_Decatur
---
captured by a squadron
https://en.wikipedia.org/wiki/Capture_of_USS_President
---
1937
https://en.wikipedia.org/wiki/1937
---
Spanish Civil War
https://en.wikipedia.org/wiki/Spanish_Civil_War
---
Nationalist
https://en.wikipedia.org/wiki/Francoist_Spain
---
Republican
https://en.wikipedia.org/wiki/Second_Spanish_Republic
---
Second Battle of the Corunna Road
https://en.wikipedia.org/wiki/Second_Battle_of_the_Corunna_Road
---
1947
https://en.wikipedia.org/wiki/1947
---
Black Dahlia
https://en.wikipedia.org/wiki/Black_Dahlia
---
Leimert Park, Los Angeles
https://en.wikipedia.org/wiki/Leimert_Park,_Los_Angeles
---
1962
https://en.wikipedia.org/wiki/1962
---
Derveni papyrus
https://en.wikipedia.org/wiki/Derveni_papyrus
---
manuscript
https://en.wikipedia.org/wiki/Manuscript
---
Macedonia
https://en.wikipedia.org/wiki/Macedonia_(Greece)
---

Note that the content of Wikipedia’s main page may change over time, so it’s very likely that your output is different from mine. Another important thing to reiterate is that not all websites allow you to scrape their content.