Introduction
Python’s requests
module is a powerful tool for making HTTP and HTTPS requests from Python. However, server owners might need to block these requests to prevent scraping, bot activity, or maintain server health. In this tutorial, we’ll explore ways to detect and block requests made by this popular module.
Identifying Requests
First off, we should understand how to identify requests coming from the requests
module. Typically, this can be done by checking the User-Agent header that accompanies HTTP requests. The default User-Agent for requests
is something like \’python-requests/x.y.z\’, where x.y.z is the installed version.
def is_python_requests(request):
user_agent = request.headers.get('User-Agent')
if 'python-requests' in user_agent:
return True
return False
Blocking by User-Agent
Once you’ve identified the requests, you can block them based on the User-Agent header. On the server side, configure your web server (e.g., Nginx or Apache) to deny access to User-Agents that include ‘python-requests’.
# Example configuration snippet for Nginx
cif_http_user_agent $bad_agents {
default 0;
~*python-requests 1;
}
map $bad_agents $block_bad_agents {
default 0;
1 1;
}
server {
if ($block_bad_agents) {
return 403;
}
...
}
In Apache, you could use the mod_rewrite
module to achieve a similar effect.
Rate Limiting
A more nuanced approach is to rate-limit requests from suspicious User-Agents. Tools like fail2ban
or web application firewalls (WAFs) offer ways to automatically add rate limits or temporary bans on offending IP addresses.
# Example of a fail2ban filter
[Definition]
failregex = .*python-requests.*
...
Advanced Techniques: Behavioral Analysis
If the client changes their User-Agent to avoid detection, you may need to analyze the behavior of requests over a period and look for patterns typical of automated scripts — consistent intervals between requests, for example, or a high number of requests in a short period.
class RequestBehavior:
def __init__(self):
self.request_times = []
def log_request(self, time_of_request):
self.request_times.append(time_of_request)
def is_suspicious(self):
# Implement analysis of self.request_times
...
IP-based Blocking
Another method of blocking unwanted requests is by IP. Standards such as IP
use IP address blacklists to block a range of IP addresses that are known to be associated with scrapers or bots.
# IP-based blocking using iptables
sudo iptables -A INPUT -s 123.45.67.89 -j DROP
Using a CAPTCHA
Implementing a CAPTCHA on your site is a common and effective way of filtering out bots. This won’t block the requests per se, but it will prevent automated access to areas of your site by requiring user interaction.
<form method="POST">
<!-- Include CAPTCHA here -->
</form>
Conclusion
In this tutorial, we explored multiple methods to detect and block requests coming from Python’s requests
module. These strategies help maintain server health against unwanted scraping or automated traffic. Remember to balance the strength of these measures against the legitimate need for bot activity, such as search engine indexing.