Extract all links from a webpage using Python and Beautiful Soup

Updated: January 15, 2023 By: Goodman Post a comment

This article shows you how to get all links from a webpage using Python 3, the requests module, and the Beautiful Soup 4 module. For the demonstration purpose, I will scrape and extract the main page of Wikipedia:

https://en.wikipedia.org/wiki/Main_Page

Please note that not all websites allow you to crawl content from them.

Installing Packages

Install the required modules by running the following commands:

pip install requests

and:

pip install beautifulsoup4

If you’re using a Mac, you may need to type pip3 instead of pip.

The Code

We will proceed through the following steps:

  1. Download the HTML source from the webpage by using requests
  2. Parse the HTML and extract links using Beautiful Soup
  3. Print out the result

Here’s the code:

import requests

# BeautifulSoup is imported with the name bas4 
import bs4

URL = 'https://en.wikipedia.org/wiki/Main_Page'

# Fetch all the HTML source from the url
response = requests.get(URL)

# Parse HTML and extract links
soup = bs4.BeautifulSoup(response.text, 'html.parser')
links = soup.select('a')

# Print out the result
for link in links:
  print(link.get_text())
  if link.get('href') != None:
    if 'https://' in link.get('href'):
      print(link.get('href'))
    else:
      print('https://en.wikipedia.org' + link.get('href')) # Convert relative URL to absolute URL

  print('---') # Just a line separator

When running that program, you should see something like the following:

https://en.wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
---
Terms of Use
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
---
Privacy Policy
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Privacy_policy
---
Wikimedia Foundation, Inc.
https://en.wikipedia.org//www.wikimediafoundation.org/
---
Privacy policy
https://foundation.wikimedia.org/wiki/Privacy_policy
---
About Wikipedia
https://en.wikipedia.org/wiki/Wikipedia:About
---
Disclaimers
https://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
---
Contact Wikipedia
https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us
---
Mobile view
https://en.wikipedia.org//en.m.wikipedia.org/w/index.php?title=Main_Page&mobileaction=toggle_view_mobile
---
Developers
https://developer.wikimedia.org
---
Statistics
https://stats.wikimedia.org/#/en.wikipedia.org
---
Cookie statement
https://foundation.wikimedia.org/wiki/Cookie_statement
---

Home
--- https://www.mediawiki.org/ --- [nodemon] clean exit - waiting for changes before restart [nodemon] restarting due to changes... [nodemon] starting `python main.py` --- Jump to navigation https://en.wikipedia.org#mw-head --- Jump to search https://en.wikipedia.org#searchInput --- Wikipedia https://en.wikipedia.org/wiki/Wikipedia --- free https://en.wikipedia.org/wiki/Free_content --- encyclopedia https://en.wikipedia.org/wiki/Encyclopedia --- anyone can edit https://en.wikipedia.org/wiki/Help:Introduction_to_Wikipedia --- 6,602,987 https://en.wikipedia.org/wiki/Special:Statistics --- English https://en.wikipedia.org/wiki/English_language --- James Ashley was killed https://en.wikipedia.org/wiki/Shooting_of_James_Ashley --- St Leonards-on-Sea https://en.wikipedia.org/wiki/St_Leonards-on-Sea --- East Sussex https://en.wikipedia.org/wiki/East_Sussex --- misconduct in public office https://en.wikipedia.org/wiki/Malfeasance_in_office --- negligence https://en.wikipedia.org/wiki/Negligence --- battery https://en.wikipedia.org/wiki/Battery_(tort) --- heard by the House of Lords https://en.wikipedia.org/wiki/Judicial_functions_of_the_House_of_Lords --- Full article... https://en.wikipedia.org/wiki/Shooting_of_James_Ashley --- Shannon Lucid https://en.wikipedia.org/wiki/Shannon_Lucid --- Zork https://en.wikipedia.org/wiki/Zork --- Farseer trilogy https://en.wikipedia.org/wiki/Farseer_trilogy --- Archive https://en.wikipedia.org/wiki/Wikipedia:Today%27s_featured_article/January_2023 --- By email https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/ --- More featured articles https://en.wikipedia.org/wiki/Wikipedia:Featured_articles --- About https://en.wikipedia.org/wiki/Wikipedia:About_Today%27s_featured_article --- https://en.wikipedia.org/wiki/File:Brig_Gen_R_B_Bradford%27s_grave_at_Hermies_British_Cemetery.jpg --- Roland Bradford https://en.wikipedia.org/wiki/Roland_Bradford --- killed as a result of active service in the First World War https://en.wikipedia.org/wiki/List_of_generals_of_the_British_Empire_who_died_during_the_First_World_War --- Ismail Suko https://en.wikipedia.org/wiki/Ismail_Suko --- Riau https://en.wikipedia.org/wiki/Riau --- Eastern Michigan https://en.wikipedia.org/wiki/Eastern_Michigan_Eagles_football --- bowl-game https://en.wikipedia.org/wiki/Bowl_game --- San Jose State https://en.wikipedia.org/wiki/San_Jose_State_Spartans_football --- 2022 Famous Idaho Potato Bowl https://en.wikipedia.org/wiki/2022_Famous_Idaho_Potato_Bowl --- 1987 California Bowl https://en.wikipedia.org/wiki/1987_California_Bowl --- Kimiko Hirata https://en.wikipedia.org/wiki/Kimiko_Hirata --- Fukushima nuclear disaster https://en.wikipedia.org/wiki/Fukushima_nuclear_disaster --- Cretaceous https://en.wikipedia.org/wiki/Cretaceous --- Carsosaurus https://en.wikipedia.org/wiki/Carsosaurus --- Anglo-Indian https://en.wikipedia.org/wiki/Anglo-Indian_people --- Kenneth Powell https://en.wikipedia.org/wiki/Kenneth_Powell_(sprinter) --- Karnataka https://en.wikipedia.org/wiki/Karnataka --- Arjuna Award https://en.wikipedia.org/wiki/Arjuna_Award --- Casablanca Protocol https://en.wikipedia.org/wiki/Casablanca_Protocol --- Arab League https://en.wikipedia.org/wiki/Arab_League --- Palestinian refugees https://en.wikipedia.org/wiki/Palestinian_refugees --- Max Wenner https://en.wikipedia.org/wiki/Max_Wenner --- Archive https://en.wikipedia.org/wiki/Wikipedia:Recent_additions --- Start a new article https://en.wikipedia.org/wiki/Help:Your_first_article --- Nominate an article https://en.wikipedia.org/wiki/Template_talk:Did_you_know --- https://en.wikipedia.org/wiki/File:Invas%C3%A3o_do_pr%C3%A9dio_do_Congresso_Nacional_(52615636677).jpg --- Jair Bolsonaro https://en.wikipedia.org/wiki/Jair_Bolsonaro --- invade https://en.wikipedia.org/wiki/2023_invasion_of_the_Brazilian_Congress --- National Congress https://en.wikipedia.org/wiki/National_Congress_of_Brazil --- Supreme Federal Court https://en.wikipedia.org/wiki/Supreme_Federal_Court --- Palácio do Planalto https://en.wikipedia.org/wiki/Pal%C3%A1cio_do_Planalto --- Michael Smith https://en.wikipedia.org/wiki/Michael_Smith_(darts_player) --- the PDC World Darts Championship https://en.wikipedia.org/wiki/2023_PDC_World_Darts_Championship --- adopts the euro https://en.wikipedia.org/wiki/Croatia_and_the_euro --- Schengen Area https://en.wikipedia.org/wiki/Schengen_Area --- Pope https://en.wikipedia.org/wiki/Pope --- Benedict XVI https://en.wikipedia.org/wiki/Pope_Benedict_XVI --- dies https://en.wikipedia.org/wiki/Death_and_funeral_of_Pope_Benedict_XVI --- Pelé https://en.wikipedia.org/wiki/Pel%C3%A9 --- dies https://en.wikipedia.org/wiki/Death_and_funeral_of_Pel%C3%A9 --- Ongoing https://en.wikipedia.org/wiki/Portal:Current_events --- Mahsa Amini protests https://en.wikipedia.org/wiki/Mahsa_Amini_protests --- Peruvian protests https://en.wikipedia.org/wiki/2022%E2%80%932023_Peruvian_political_protests --- Russian invasion of Ukraine https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine --- Recent deaths https://en.wikipedia.org/wiki/Deaths_in_2023 --- Charles Simic https://en.wikipedia.org/wiki/Charles_Simic --- Lisa Marie Presley https://en.wikipedia.org/wiki/Lisa_Marie_Presley --- Victoria Lee https://en.wikipedia.org/wiki/Victoria_Lee --- Siegfried Kurz https://en.wikipedia.org/wiki/Siegfried_Kurz --- Sinikiwe Mpofu https://en.wikipedia.org/wiki/Sinikiwe_Mpofu --- Yuri Manin https://en.wikipedia.org/wiki/Yuri_Manin --- Nominate an article https://en.wikipedia.org/wiki/Wikipedia:In_the_news/Candidates --- January 15 https://en.wikipedia.org/wiki/January_15 --- John Chilembwe Day https://en.wikipedia.org/wiki/John_Chilembwe --- World Religion Day https://en.wikipedia.org/wiki/World_Religion_Day --- https://en.wikipedia.org/wiki/File:Derveni-papyrus.jpg --- 1815 https://en.wikipedia.org/wiki/1815 --- War of 1812 https://en.wikipedia.org/wiki/War_of_1812 --- frigate https://en.wikipedia.org/wiki/Frigate --- USS President https://en.wikipedia.org/wiki/USS_President_(1800) --- Stephen Decatur https://en.wikipedia.org/wiki/Stephen_Decatur --- captured by a squadron https://en.wikipedia.org/wiki/Capture_of_USS_President --- 1937 https://en.wikipedia.org/wiki/1937 --- Spanish Civil War https://en.wikipedia.org/wiki/Spanish_Civil_War --- Nationalist https://en.wikipedia.org/wiki/Francoist_Spain --- Republican https://en.wikipedia.org/wiki/Second_Spanish_Republic --- Second Battle of the Corunna Road https://en.wikipedia.org/wiki/Second_Battle_of_the_Corunna_Road --- 1947 https://en.wikipedia.org/wiki/1947 --- Black Dahlia https://en.wikipedia.org/wiki/Black_Dahlia --- Leimert Park, Los Angeles https://en.wikipedia.org/wiki/Leimert_Park,_Los_Angeles --- 1962 https://en.wikipedia.org/wiki/1962 --- Derveni papyrus https://en.wikipedia.org/wiki/Derveni_papyrus --- manuscript https://en.wikipedia.org/wiki/Manuscript --- Macedonia https://en.wikipedia.org/wiki/Macedonia_(Greece) ---

Note that the content of Wikipedia’s main page may change over time, so it’s very likely that your output is different from mine. Another important thing to reiterate is that not all websites allow you to scrape their content.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments