Page Content & AI
In the rapidly evolving landscape of artificial intelligence, search engines are no longer just for human eyes. AI agents, such as Anthropic’s Claude, now search the web in real-time to ground their answers with live facts. And behind Claude’s web-search tool sits a crucial player: Brave Search.
Unlike many “private” search engines that lease their index from Google or Bing, Brave Search maintains its own completely independent web index. This independent index was built on the foundation of the German search engine Cliqz, which Brave acquired in 2021 along with its revolutionary “Human Web” technology.
Today, Brave’s index is fed by a unique, crowdsourced system called the Web Discovery Project (WDP). While Brave’s search backend is proprietary, the WDP browser extension code is fully open-source. We audited the WDP codebase to decode exactly how Brave Search discovers, filters, and indexes pages—and what this means for SEO and AI optimizations.
1. Client-Side Aggregation: De-linking the User Session
The core difference between Brave Search and legacy engines lies in data collection. Google and Bing rely on server-side aggregation. They collect browsing history, search clicks, and location data, linking them to a User ID (UID) to form a persistent profile (user session) across devices.
Brave WDP uses client-side aggregation:
- The browser client gathers telemetry data locally and in real-time.
- Instead of sending a continuous stream of URLs associated with a user identifier, the client performs the analysis locally on the user’s machine.
- When specific heuristics are met (e.g., a search query fails to satisfy a user, causing them to search elsewhere), the client sends a single, isolated message to Brave’s servers.
- Because there are no persistent UIDs on Brave’s servers, it is mathematically impossible for Brave to group these signals into a user session or history.
2. Code Breakdown: The Double Fetch and URL Sanitization
To prevent indexing private documents, logged-in states, or sensitive dashboards, WDP implements strict filtering checks. In the codebase, this is handled through a process called Double Fetch and deep regex-based url sanitization.
The Double Fetch Algorithm
In web-discovery-project.es, the function validDoubleFetch(struct_bef, struct_aft) is responsible for comparing page states:
struct_bef(User State): When a user visits a page, the client records its structural metadata (number of password inputsnip, formsnf, framesnif, text lengthlt, and page titlet).struct_aft(Anonymous State): The browser fires a secondary, background HTTP request (Double Fetch) using thefetchAPI without credentials or cookies.- The Comparison: If the signatures differ significantly, the page is flagged as private and discarded. WDP detects if the anonymous fetch redirects to a login wall by checking if the anonymous state suddenly has password inputs or new forms:
if (struct_bef["nip"] == 0 && struct_aft["nip"] > 0) { logger.debug("validDoubleFetch: fail nip"); return false; // Discarded as private } - Jaccard Title Overlap: WDP splits the titles of both fetches into token arrays and calculates a Jaccard overlap coefficient. If the titles mismatch (Jaccard score $\le 0.5$ for short titles or $\le 0.8$ for long titles), the page is marked private:
jc = WebDiscoveryProject.auxIntersection(vt1, vt2).length / WebDiscoveryProject.auxUnion(vt1, vt2).length;
Heuristic URL Filtering
In sanitizer.js, the function sanitizeUrl(url) filters out telemetry leaks:
- Credentials: Rejects URLs containing inline usernames or passwords (
parsedUrl.username,parsedUrl.password). - Port Checks: Discards URLs using non-standard ports (allowing only 80 and 443).
- Local Networks (
isLocalURL): Resolves IP addresses and blocks local ranges (e.g.,127.0.0.1,localhostor RFC 1918 private subnets) to prevent crawling internal corporate wikis. - Risky Path Parts: A static list of strings (
RISKY_URL_PATH_PARTS) immediately triggers URL truncation or deletion. This blocks directories like/admin,/checkout,/wp-admin,/forgot-password,/logout,/session, and/token. - Markov Chain Hash Detection: To strip tracking parameters and unique identifiers, a Markov Chain classifier (
probHashLogM) evaluates path segments and search query parameters. If a segment has high entropy and looks like a hash key, it is blocked. - Local Bloom Filters: Flagged private URLs are MD5-hashed and stored in a local Bloom Filter on the user’s device. This provides plausible deniability—there is no plain-text database of private browsing history stored locally.
3. The Quorum Problem (k) and the Search Referral Bypass
To protect low-traffic or capability URLs (Google Docs share links, Dropbox files, etc.), WDP implements the STAR protocol.
A visited page URL is encrypted locally using an AES key derived from the URL itself, and a cryptographic “share” is sent to the server. The server can only decrypt the payload and learn about the URL if at least (k) different users (with different IP prefixes) have visited that exact same URL. This is the Quorum Check.
The Challenge for Small Websites
For small, niche, or local websites (especially in smaller markets like Denmark), getting (k) unique visitors (using Brave Browser with WDP enabled) to a deep underside can take months. This delays indexing.
The Code-Level Genway: Search Referral (qr)
WDP includes a built-in bypass for this constraint. If a user lands on your page via a search engine result page (Google, Bing, DuckDuckGo, Yahoo, etc.), the client appends a qr (query referral) tag:
"qr": {
"q": "brave search open source",
"t": "go",
"d": 1
}
When the qr field is present, WDP completely bypasses the quorum check. Brave’s system reasons that if a page was discovered via an active search engine query and a subsequent click, it is per definition a public, safe-to-index page.
- Actionable SEO Hack: To get a new underside indexed by Brave Search instantly, you don’t need a crowd of visitors. You just need a few users to find and click your link from a search engine result.
4. Brave Search SEO and AI Ranking Signals
If you want your website to rank high in Brave Search—and consequently be cited by AI engines like Claude and Brave’s Answer with AI—you must optimize for the following signals:
1. Optimize for Real User Engagement
Google uses a wide web of device telemetry, but Brave WDP collects specific, isolated ad-free metrics:
payload.a: Active engagement time (how long the tab is active and visible).payload.e.mm: Mouse movements (proof of active human reading).payload.e.sc: Scrolling events (proof of reading down the page). Optimize your content layout. Make your articles easy to read, engaging, and interactive so users naturally scroll and spend time on your page.
2. Deliver Server-Side HTML to Anonymous Requests
Brave’s Double Fetch crawler requests your page without cookies or session tokens. If your site blocks anonymous rendering (e.g., hard cookie walls or strict paywalls), the structural signature mismatch will cause Brave to flag your site as private and delete it from the search index. Ensure your server delivers clean, indexable HTML to anonymous bots.
3. Implement Comprehensive Schema.org Markup
Brave’s AI search models and third-party AI agents rely heavily on structured data to parse facts. Having flawless JSON-LD Schema markup (e.g., Product, FAQPage, Article, LocalBusiness) is essential for AI tools to extract your content and credit you as the cited source.
4. Build Authority in Discussions (Discussions Engine)
Brave Search has an integrated “Discussions” widget that displays relevant threads from Reddit and StackExchange directly on the search page. The algorithm weights these threads based on a “discussion-worthiness” score (freshness, popularity, and upvotes/replies). Actively participating and providing value in relevant forums can put your brand on Brave’s front page, even if your main domain doesn’t rank yet.
Summary: Good SEO Remains Good SEO
Ultimately, optimizing for Brave Search and AI agents doesn’t require manipulative SEO hacks. It reinforces the core pillars of modern web development:
- Solid Technical Foundation: Fast load times, clean HTML structure, and no blocking scripts.
- Structured Metadata: Clear Schema.org markups.
- Killer Content: High-quality, educational content that answers user intent and keeps readers engaged.
To ensure Brave Search fetches your latest pages immediately, you can skip the wait and submit your URLs directly to the Brave Search URL Submission Tool.
Søren Riisager