Web Scraping: HTML with PostgreSQL XML and XPath
Web scraping and hacking sometimes get really painful when hitting rate limits, especially when you deal with a great amount of HTML source. To ease the pain, usually we dump a copy of the HTML source to local disks because it is very fast, very easy and on day one that feels like a sweet solution for caching. But soon enough, your cache of scrapes outgrows a single machine, or you’re running multiple spiders across different servers and the next thing you know, you’re stuck with half a dozen file dumps across the network. That’s not fun.
1. Why Bother Storing HTML in a Database?
Most people who scrape websites maintain a local copy of each page in a directory somewhere on a local or network-attached disk-like EFS on AWS. That works great if you’re only running one spider on one machine. But when you scale out-multiple scrapers, multiple teams, or a distributed environment - file-based storage becomes a pain. You either have to manage some sort of elaborate system of network shares or invent your own synchronization magic.
This is where PostgreSQL really shines: as a central repository. Your spiders can just write to a table rather than fiddle around with the file system paths. It is all consistent, shared, and can easily be backed up as part of your normal DB routine. I feel like this solution flies under the radar, but it can radically simplify your solution.
2. The Secret Sauce: Postgres as a Central Cache
So, we want a centralized store for these HTML pages. The naive approach: just store the pages as TEXT
. That’s valid. In Postgres, we can have:
CREATE TABLE pages (
id SERIAL PRIMARY KEY,
url TEXT NOT NULL,
html_source TEXT
);
But wait — Postgres has something special for us: a built-in XML
column type, along with powerful XPath functionality. If you ever wanted to get fancy with searching and indexing your page content, you can do it with far more precision than with a plain text field.
3. Text vs. XML: A Surprising Trick
Why not use TEXT
? After all, HTML is basically text. Well, the trick is that if your HTML is near to XML,
or can be transformed into XML, you can use Postgres’s built-in XML type and its associated indexing features. A little Python function can easily do this conversion of HTML to well-formed XML. Libraries like lxml
in Python do this elegantly, especially if you’re careful with invalid tags or unclosed tags. Something like:
raw_html = "<html><body><h1>Hello World</h1><a href="https://www.google.com" active>Click<br> here!</body></html>"
tree = html.fromstring(raw_html)
xhtml_content = html.tostring(tree, method="xml", encoding="unicode")
# Now xhtml_content is proper XML, e.g.: '<html><body><h1>Hello World</h1><a href="https://www.google.com" active="">Click<br/> here!</a></body></html>'
Store xhtml_content
as an XML
type in Postgres:
CREATE TABLE pages (
id SERIAL PRIMARY KEY,
url TEXT NOT NULL,
html_source XML
);
Now we have the data in a consistent XML format. Magically, this opens up the door to use Postgres’s native XPath functions: xpath()
, xpath_exists()
, and xmltable
(all covered in the Postgres manual). For more details, refer to the Postgres manual.
4. Scraping, Hacking, and Querying (with XPath!)
Let’s take a look at how we can query our stored HTML-now-XML:
sqlSELECT
xpath(
'/html/body/h1/text()',
html_source
) AS heading_text
FROM pages
WHERE id = 42;
/html/body/h1/text()
is a typical XPath expression that extracts the text from the<h1>
tag.xpath()
will return an array ofXML
values. Often, you’ll cast them toTEXT[]
if you want to handle them as strings in your queries.
That may seem a small difference from a simple LIKE '%some heading%'
, but XPath expressions let you zero in on specific elements. If you’re writing a screenscraper, it’s extremely handy, because you can skip all the fluff around the data you really want to extract.
5. Indexing for Speed: The GIN Trick
And here’s the cherry on top: you can index your XPath queries for faster lookups. Suppose you want to search for specific <project_id>
nodes inside some chunk of stored HTML. One approach is to use the fact that xpath()
returns an array of text values. This example is adapted from real usage patterns:
CREATE TABLE candidates (
id SERIAL PRIMARY KEY,
assignments XML
);
Insert a bunch of data:
sqlINSERT INTO candidates (assignments)
VALUES (
'<assignments>
<assignment><project_id>10000042</project_id></assignment>
<assignment><project_id>20000042</project_id></assignment>
</assignments>'
);
Now create a GIN index on the array of project_id
s extracted by XPath:
CREATE INDEX candidates_xpath_idx
ON candidates
USING GIN (
(xpath('/assignments/assignment/project_id/text()', assignments)::text[])
);
If we want all rows that contain, say, 10000042
as a <project_id>
, we can query:
SELECT *
FROM candidates
WHERE xpath('/assignments/assignment/project_id/text()', assignments)::text[]
@> ARRAY['10000042'];
- The operator
@>
means “contains” for arrays. - Postgres can use our GIN index here, making the lookup extremely efficient—even at massive scale.
If you need to check multiple IDs at once, you can do:
sql-- returns rows containing BOTH 10000042 AND 20000042
SELECT *
FROM candidates
WHERE xpath('/assignments/assignment/project_id/text()', assignments)::text[]
@> ARRAY['10000042', '20000042'];
Or use the overlap operator &&
to find rows that have at least one of the given project IDs:
-- returns rows that contain ANY of the listed IDs
SELECT *
FROM candidates
WHERE xpath('/assignments/assignment/project_id/text()', assignments)::text[]
&& ARRAY['10000042', '20000043'];
Performance Tip: GIN indexes can be large and more expensive to maintain on writes compared to a simple B-Tree. This is a trade-off: if you frequently update the data, be aware of the overhead.
- Final Thoughts
I love being able to point clients towards these lesser known little hacks. Sure, store your scraped HTML as TEXT
to save yourself some time works, but look at the pay-off of storing well-formed XML:
- You can elegantly filter your data with XPath.
- You’ll get to leverage some powerful indexing for super fast queries.
- You no longer have to poke around half a dozen local disks nor care about concurrency from multiple spiders.
Of course, not every project needs this complexity. If your scraping is ephemeral or you rarely need to retrieve old HTML, local file storage might be enough. But when you see your application requiring repeated lookups, partial extraction, or advanced filtering, it’s worth exploring the harmony between PostgreSQL, scraping, HTML caching, and a few hacks with XPath.
The official PostgreSQL documentation is a great next step in your journey. It includes details on the xmltable
, xpath_exists()
functions, as well as several ways to declare namespaces and ways to transform messy HTML into queryable goodness.
If you ever find yourself hitting performance bottlenecks or needing to slice and dice stored HTML in creative ways, remember that Postgres has your back. Feel free to reach out if you want deeper insights — I’m always up for a chat about making data wrangling more elegant and more efficient!
Happy scraping and querying!