Home Assistant Integration

Last updated: 18 June, 202610 min read

Data flow: PageCrawl monitors web pages, detects changes, updates a Home Assistant sensor per monitor, which triggers your dashboards and automations

The PageCrawl Home Assistant integration turns each of your page monitors into native Home Assistant entities you can view on your dashboard, chart over time, and build automations on. Entities update in real time whenever PageCrawl detects a change.

Note: Works on free PageCrawl accounts. You connect with OAuth (one click, no API token to create or paste). Your plan's monitor and check limits still apply.

Great Use Cases (what `rest` and `scrape` sensors can't do)

Home Assistant already fetches URLs and scrapes CSS selectors on simple, static pages. PageCrawl is for the pages those built-in sensors fall short on:

Sites that block ordinary scrapers. A rest sensor pointed at a concert resale page, an airline fare, or most big retailers comes back with an error page or a "are you human?" challenge instead of the value. PageCrawl reads these reliably, so a ticket price, a flight fare, or a product page keeps working as a sensor.
Pages behind a login, including ones that send a one-time code. Your energy tariff, broadband usage, council tax balance, or parcel status only shows once you are signed in, and some of those email a one-time passcode (OTP) every time. PageCrawl signs in for you, completes the OTP step, and surfaces the value. A scrape sensor just lands on the login screen.
JavaScript-rendered pages. A GPU or console restock counter, a live delivery-slot grid, an EV charger's status, or a price that is injected by the page's scripts is simply not in the HTML a plain fetch downloads, so scrape returns nothing. PageCrawl loads the page fully first, then reads the rendered value.
Pages that need steps first. Your council's next bin-collection date appears only after you type a postcode and submit the form. A GP or DVSA test slot appears only after you pick a location. PageCrawl performs those clicks and form fills once you configure them, and Home Assistant reads the result.
AI extraction instead of brittle selectors. Ask in plain language for "the next bin collection date" or "is this product in stock", and it keeps working when the page is redesigned and the CSS selector you would have written breaks. No div:nth-child(7) > span to re-find every time the site changes.
Change detection without false positives. A raw scrape of a planning-application page or a terms-of-service page fires on every rotating banner, view counter, or reordered block. PageCrawl filters that noise out, so you are alerted only when the date, the price, or the actual text changes, with a human-readable summary of what changed.
Visual change detection. Know when a status page, a webcam still, or a product image actually looks different, backed by screenshots, even when there is no clean text value to scrape at all.

When to Use rest/scrape vs PageCrawl

Home Assistant's built-in rest and scrape sensors are a great fit for many pages, and they run entirely locally. Reach for PageCrawl only when they fall short:

Use a built-in `rest` / `scrape` sensor when	Use PageCrawl when
The page is static, public HTML or a JSON API	The page needs JavaScript to render the value
A stable CSS selector or JSON path exists	No reliable selector, or it breaks when the page changes
The page has no login and allows automated requests	The page needs a login or blocks ordinary scrapers
The value is visible on first load	The value only appears after a login, click, or form submission
You only need the current value	You want change history, diffs, or a human or AI summary of what changed
Any change to the value is meaningful	The page is noisy (ads, timestamps, reordered blocks) and you only want real changes
You are happy maintaining the selector yourself	You want AI extraction and no scraping logic to maintain

If a scrape sensor already returns the value you need, keep using it. Bring in PageCrawl for the pages where it comes back empty, gets blocked, or needs constant selector fixes.

What You Get

A PageCrawl monitor as a Home Assistant device, showing its sensors, a Check now control, and diagnostic entities for last checked, last change date, and status

One Home Assistant device per monitor.
One entity per tracked element on that monitor, typed correctly (numeric, on/off, text, or item counts).
Real-time push updates when a change is detected, with polling as a fallback.
A Check now button on every monitor, plus actions to create new monitors.
A pagecrawl_change event you can trigger automations from.
A choice of what to import (everything, selected folders, or selected monitors), with support for multiple workspaces.

Installation (HACS Custom Repository)

The integration installs through HACS as a custom repository.

In Home Assistant, open HACS.
Open the menu (top right) and choose Custom repositories.
Add the repository URL https://github.com/pagecrawl/hass-pagecrawl with category Integration, then add it.
Find PageCrawl in HACS, install it, and restart Home Assistant.

If you are reading this on the device running Home Assistant, the button below opens the repository in HACS directly:

Connecting Your Account

Go to Settings > Devices & Services > Add Integration and search for PageCrawl.
You are redirected to PageCrawl to sign in and authorize Home Assistant. There is no token to paste, and a free account is enough.
If your account has more than one workspace, pick the one to add. Each workspace becomes its own entry with its own devices and entities. To add another workspace later, run Add Integration again and choose a different one.

Choosing What to Import

During setup you pick how much of the workspace to bring into Home Assistant:

All monitors (default): every monitor in the workspace becomes a device.
Selected folders: only monitors in the folders you choose are imported.
Selected monitors: you hand-pick the exact monitors to import.

You can change this later in the integration's Configure screen. If you narrow the selection, the devices and entities for the de-selected monitors are removed automatically. Widening it again imports the newly in-scope monitors on the next update.

Real-Time Updates vs Polling

The update mode is set in the integration's Configure screen.

Auto (default): uses push when Home Assistant has a reachable URL, otherwise falls back to polling. The integration tells you which mode is active.
Push and poll: forces push, with a slow reconciliation poll to catch any missed deliveries. Needs a reachable URL.
Polling only: never registers a webhook and checks on the interval you set. Use this for local-only installs that cannot expose an endpoint.

Push needs a URL that PageCrawl can reach from the internet. A Home Assistant Cloud (Nabu Casa) cloudhook is the recommended way to get one, and it is configured automatically. If no reachable URL is available, the integration falls back to polling. The poll interval has a 60 second minimum to respect rate limits.

How Monitors Map to Entities

Each monitor becomes a device, and each tracked element becomes one entity chosen by its type:

Element type	Entity	State
Price	sensor (monetary)	numeric value
Number	sensor (measurement)	numeric value
Rating	sensor (measurement)	numeric value
Reviews	sensor (measurement)	numeric value
HTTP status	sensor	numeric status code
Boolean	binary sensor	on when truthy
Availability	binary sensor	on when in stock
Text, Full Page, HTML, AI Extract, and similar	sensor	text value (full value in an attribute when truncated)
Links, Feed, Leaderboard, and other lists	sensor	item count (items in an attribute)

Every monitor also gets diagnostic entities (status, last checked, last change date, change percent), so a device is never empty even if its element types are unrecognized. Common details such as the URL, status, change percent, and diff and screenshot links are exposed as attributes on the primary sensor.

Each monitor also gets a few per-monitor sensors that describe its latest change:

Last change: a short, human-readable summary of what changed at the last check (the full text is available in an attribute).
AI summary: the AI summary of the latest change. It appears only when AI analysis is enabled on that monitor.
AI priority: a diagnostic score for how important the latest change is. It appears only when AI analysis is enabled on that monitor.

Actions

The integration provides two actions you can call from automations, scripts, or the Developer Tools:

Check now (pagecrawl.check_now): trigger an immediate check of one or more monitors, then refresh their entities. Target any entity or device that belongs to the monitor, or name the monitor directly by slug or monitor_id:

service: pagecrawl.check_now
data:
  slug: openai-about

Track a new page (pagecrawl.track_page): create a new monitor from a URL, name, and tracking mode (for example price or ai_extract). Its device and entities appear after the next refresh. If you have more than one workspace, add the entry to choose which one it is created in.

Automations

The integration fires a pagecrawl_change event whenever a monitor's latest change advances, so you can react to it in automations.

alias: Notify on PageCrawl change
trigger:
  - platform: event
    event_type: pagecrawl_change
action:
  - service: notify.notify
    data:
      title: "PageCrawl: {{ trigger.event.data.name }}"
      message: >-
        {{ trigger.event.data.human_difference }}
        {{ trigger.event.data.diff_url }}

Event data includes the monitor name, URL, slug, status, the change contents and difference, a human-readable summary, a diff link, and a timestamp. When AI analysis is enabled on the monitor, the event also carries the AI summary and an AI priority score, so you can filter and route changes straight from the event without looking up a per-monitor sensor.

Editing and Removing Monitors

The integration can create monitors and read and check them, but editing and deleting monitors is done in the PageCrawl web app. Changes you make there are reflected in Home Assistant on the next update.

Webhook Integration explains the underlying change payloads, which power the real-time push updates.
See the developer documentation for the full API and webhook reference.

Ready to Track Changes?

Set up monitoring in under 60 seconds and never miss important updates again.

Track a New Page

Great Use Cases (what rest and scrape sensors can't do)