Module 07 — Automating the Web¶
Type 9 · Tool-Build — build an httpx + beautifulsoup4 scraper that extracts links, finds hidden endpoints, and follows redirect chains against a bounded local target, proven by a test_scraper.py that pins endpoint discovery, the off-host scope guard, and 404 resilience. Only scrape hosts you own or are permitted to test. (Secondary: Build-&-Operate — scope and session handling that keeps the crawl safe at scale.) Go to the hands-on lab →
Last reviewed: 2026-06
Python for Security — an unprotected endpoint is only hidden if nobody looks; a scraper looks.
In 60 seconds
A web scraper is reconnaissance, data collection, and automation in one: httpx sends the
requests, beautifulsoup4 parses the HTML, and together they cover ~80% of practical scraping.
Apps leak their structure — /admin links, /api/v2/internal comments, paths buried in JS — and
that structure is the map. The line between responsible automation and unauthorized access isn't
the tool; it's scope and authorization: identify yourself, respect robots.txt, and only touch
targets you own or are permitted to test.
Why this matters¶
Web scraping is a reconnaissance skill, a data-collection skill, and an automation skill in one. Security engineers scrape for exposed configuration files, forgotten admin endpoints, and leaked credentials. Automation engineers scrape to collect threat data from sites that don't offer an API. The same techniques apply to both, and the line between "responsible automation" and "unauthorized access" is authorization and scope — not the tool.
Objective¶
Use httpx and beautifulsoup4 to scrape a local web application: extract all links, identify
hidden or unlisted endpoints, and follow a redirect chain — and prove it with a test you wrote:
a test_scraper.py that asserts the hidden endpoint is discovered, the off-host guard holds, and a
404 doesn't crash the crawl. Building the scraper and committing a test that pins those behaviours
are equal halves — all within a clearly bounded local target, never against external hosts without
permission.
The core idea¶
HTTP is a text protocol. A web scraper is a program that sends HTTP requests and parses the text
responses — mechanically identical to what a browser does, minus the JavaScript execution and the
display rendering. httpx handles the HTTP; beautifulsoup4 handles the HTML parsing. The
combination covers about 80% of practical web-scraping needs.
The mental model
A scraper is a browser minus the JavaScript and the rendering — it sends HTTP and parses text. For security work, the payload isn't the page content, it's the structure: the unlinked endpoint, the commented-out API path, the script that embeds an internal route. You're reading the app's map, not its words.
The key insight for security work is that web applications leak information through their
structure: links to /admin, comments referencing /api/v2/internal, action= attributes
pointing to unlinked endpoints, JavaScript files that embed API paths. These are not
vulnerabilities in themselves, but they are the map. soup.find_all("a", href=True) gets you
every explicit link; soup.find_all(True) gets you every tag; a regex on the raw HTML source
(re.findall(r'/[\w/.-]+', response.text)) surfaces paths embedded in scripts and attributes
that BeautifulSoup doesn't expose.
Session handling is the next layer: cookies, authentication tokens, and CSRF tokens are what keep
a web application's state. httpx.Client (the session object) automatically stores and resends
cookies across requests — you log in once and every subsequent request carries the session. This
is also what makes authenticated scraping work: log in with a POST to the login endpoint,
capture the session cookie, then scrape pages that require authentication.
The gotcha
Scraping a system you don't own carries the same authorization bar as a port scan — written
permission, defined scope. The most common failure is a crawler that follows a link off the target
domain; without a urlparse(url).netloc == target_host guard and a depth limit, your scoped recon
quietly becomes unauthorized access to someone else's host. For this lab, the target is a local
Flask app you run yourself — no external targets, ever.
Go deeper: finding what isn't linked, and session state
soup.find_all("a", href=True) gets explicit links, but a regex on the raw source
(re.findall(r'/[\w/.-]+', response.text)) surfaces paths embedded in scripts and attributes that
BeautifulSoup doesn't expose — often where the interesting endpoints hide. For authenticated
scraping, httpx.Client stores and resends cookies across requests, so a single POST to the
login endpoint carries the session through every subsequent page.
Learn (~2 hrs)¶
httpx sessions (~30 min) - httpx — Client documentation — focus on the "Client Instances" and "Cookies" sections; understand how a session maintains state across requests.
BeautifulSoup4 (~1 hr)
- BeautifulSoup4 — Official documentation — read the "Quick Start", "Kinds of objects", and "Searching the tree" sections; find_all() is 90% of what you'll use.
- Web Scraping with BeautifulSoup — Real Python — worked example with good coverage of common patterns; skip the "Interact with HTML Forms" section (that's for browser automation).
Responsible scraping (~30 min) - robots.txt specification — robotstxt.org — understand what it says and what it does not say legally; short page, worth reading in full.
Key concepts¶
httpx.Clientas a session: cookies and headers persist across requests- BeautifulSoup4 parse tree:
find_all(),get(),.text,.attrs - Extracting links from HTML vs. extracting paths from raw text (regex on
response.text) - Following redirect chains with
httpx(by default it follows;follow_redirects=Falseto inspect) - Responsible automation: User-Agent,
robots.txt, rate-limiting, scope limitation - Verify by test, not by eye: a learner-written
test_scraper.pythat asserts hidden-endpoint discovery, the off-host guard, and 404 resilience — the ownership half, not a diff againstmake demo
AI acceleration¶
A model will write the scraper quickly. The missing piece is usually the scope: the model will
write code that follows links off the target domain if you don't specify urlparse(url).netloc ==
target_host as a filter. Test it against a URL that resolves to an external site — does it
follow the link? Does it have a depth limit? These are the production safety checks.
AI caveat
A model writes the scraper quickly and omits the scope guard by default — its crawl will happily walk off the target domain unless you specify the netloc filter. Test it against a URL that resolves off-host: does it follow? Does it bound depth? Those checks are what keep the crawl legal.
Check yourself
- What separates "responsible automation" from "unauthorized access" — and what is it not?
- Why scan the raw
response.textwith a regex when BeautifulSoup already finds the links? - How does
httpx.Clientmake authenticated scraping work across multiple requests?
Comments
Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).