
How to Build a Reliable Amazon Search Results Scraper
Purpose: Extract the title, price, and product page URL from Amazon search results using Witrium's no-code workflows combined with the Python SDK for orchestration of parallel runs to achieve scale.
Legal note: Always review and comply with Amazon's Terms of Service and applicable laws. Use respectful rates, avoid abuse, and prefer official APIs when available.
What we're building
Let's say you need to scrape details from the top 500 Amazon search results for "vacuum cleaners"
- extracting titles, prices, and product URLs. A manual approach would require navigating through 32+ pages (Amazon typically shows 16 results per page - depending on locale/device), maintaining a long browser session, and keeping unnecessary context in memory, which degrades performance.
This tutorial shows you a smarter approach with Witrium: Don't keep one mega browser session alive to scrape across all 32 pages. Instead, define one single-page extraction workflow in Witrium, then run it across all the pages simultaneously by tweaking a URL parameter and parallelizing the runs.
Base search URL (page 1):
https://www.amazon.com/s?k=vacuum+cleaners&page=1
We'll change
k
(search term) andpage
for subsequent runs to cover all pages, and even reuse this workflow for other keywords.
Traditional Approaches vs. Witrium
There are several approaches to web scraping, each with significant drawbacks. Here are the most common ones:
1. Traditional Automation Scripts (Selenium/Playwright)
Finely crafted scripts that use Selenium, Playwright, or Puppeteer to navigate the web and extract data.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
driver.get("https://amazon.com/s?k=vacuum+cleaners")
products = driver.find_elements(By.CSS_SELECTOR, ".s-result-item")
# ... hundreds of lines of brittle selector logic
Major Drawbacks:
- Painstaking Development: Hours spent inspecting element IDs, CSS selectors, and XPath expressions
- Extremely Brittle: Scripts break whenever Amazon changes their HTML structure (which happens frequently)
- Debugging Nightmare: Cryptic failures when selectors stop working, requiring constant maintenance
- Infrastructure Overhead: Need to deploy and manage your own browser infrastructure
- Stealth Complexity: Manually implementing anti-detection measures, proxy rotation, and browser fingerprinting
- Limited Adaptability: Hard-coded selectors can't adapt to dynamic content or layout changes
2. DIY AI Agent Solutions
Building your own AI-powered web scraper using LLMs to interpret page content.
Overwhelming Complexity:
- Prompt Engineering: Months of iteration to craft prompts that work reliably across different page layouts
- Context Window Management: Manually chunking and optimizing content to fit model limitations
- Model Selection: Researching and testing different LLMs for accuracy, speed, and cost trade-offs
- Agent Deployment: Setting up infrastructure to run your AI agent at scale
- Browser Infrastructure: Still need to manage headless browsers, sessions, and stealth capabilities
- Cost Optimization: Balancing model performance against API costs per extraction
3. General-purpose AI Agents (Browser-Use, OpenAI Operator)
Using general-purpose AI browsing agents that can navigate websites autonomously.
Fundamental Limitations:
- Exploratory Nature: Designed for one-off tasks, not repeatable production workflows
- Lack of Robustness: Can get confused and veer off to different tangents that don't converge to the user's goals
- Inconsistent Results: May extract different data formats or miss elements across runs
- No Workflow Optimization: Can't leverage patterns or shortcuts for better performance
- Poor Scalability: Not built for high-volume, parallel processing scenarios
4. Managed Browser Services (Browserbase, Browserless)
Using cloud browser services to handle the infrastructure while you build the scraping logic.
Still Your Problem:
- Resource Management: You're responsible for scaling, session handling, and cost optimization
- Session Complexity: Managing authentication, cookies, and state across multiple requests
- Script Maintenance: All the brittleness of traditional scripts still applies
The Witrium Advantage
Witrium eliminates all these pain points by providing a purpose-built platform for intelligent web automation:
✅ No Code Required - Visual workflow builder instead of brittle scripts
✅ AI-Powered Adaptability - Automatically adapts to layout changes
✅ Fully Managed Infrastructure - Browser deployment, scaling, and stealth handled automatically
✅ Optimized for Repeatability - Built specifically for production workflows, not exploration
✅ Integrated Scaling - Parallel processing and resource management included
✅ Zero Maintenance - No agent engineering, prompt engineering, model selection, or infrastructure management
Simply define your workflow, and Witrium handles the rest.
What you'll learn:
- Building a no-code Amazon product extraction workflow
- Leveraging URL parameters for scalable scraping
- Orchestrating parallel workflows using Witrium's Python SDK
- Optimizing performance through smart pagination handling
Prerequisites
- A Witrium account and an API token (for orchestration) that you can find in the Witrium dashboard
- Basic Python environment (optional, for orchestration)
Step-by-Step Implementation
Below are the build steps exactly in the intended order, with added clarity and developer-friendly details.
1) Create the workflow
- Once created an account and logged in, go to https://witrium.com/workflows and click Create Workflow.
- Name it "Amazon Products" and click Create Workflow.
2) Set the Target URL
Set the workflow's Target URL strategically to maximize reusability:
https://www.amazon.com/s?k=vacuum+cleaners&page=1
Why this matters: On later runs, you can override the entire Target URL. That lets you:
- A) swap the
k
parameter to run other search terms (e.g.,air+purifier
). - B) change the
page
number to paginate (1 → 32). No need to navigate through pagination manually.
3) Start a Build Session
Click Start Build Session. Witrium spins up a cloud-based Chrome browser instance and opens your Target URL. You'll see a live stream of the browser on the right, so you can test each instruction as you add it.
4) Add a scroll instruction (Instruction #2 in the workflow)
Add the following instruction:
Scroll down by 350 pixels
Why: This hides Amazon's header/ads that are usually on top of the page and brings the LLM's focus to the actual product list. Witrium provides the page context to the agent from the current scroll position to the bottom. If the content you want begins lower on the page, it is recommended to scroll first to the beginning of that content so the model isn't distracted by non-relevant matter located above.
Click Play on this instruction to verify the page scroll by 350 pixels.
5) Add the data extraction instruction (JSON output)
Add a detailed extraction instruction (be explicit and consistent to keep the results across runs stable):
Extract the following data for all products visible on the page and return as JSON:
1) `title`: the full product title (string).
2) `price`: the listed price as shown on the page (string). Do not round or normalize. Include currency symbol if present.
3) `url`: the product page URL (string). It must start with "https://www.amazon.com" or "https://amazon.com". Do not make this up if not available. Leave it as null if not available.
Return a top-level JSON object: { "items": [ ... ] }
Tip: You can also specify an explicit JSON schema to enforce structure. Keep instructions deterministic and avoid vague phrasing.
Click Play to run the extraction. Witrium will interpret the page and render the JSON results beneath the instruction. For dense pages, it may take a minute or two.
6) End the Build Session
Once the extracted data displayed below the instruction looks correct, click End Build Session. Your single-page extractor is complete.
7) Orchestrate 32 pages (REST or Python)
We'll run this workflow 32 times, changing the page
parameter on each run. You can:
- Call the workflow's auto-generated REST endpoint, or
- Use the Witrium Python SDK to parallelize runs.
Witrium manages concurrent browsers and infrastructure; you simply collect results.
Option A: Trigger via REST (example)
Replace placeholders like
<WORKFLOW_ID>
and<API_TOKEN>
with your values.
Kick off a run for page 1:
curl -X POST -H "Authorization: Bearer <API_TOKEN>" -H "Content-Type: application/json" "https://api.witrium.com/v1/workflows/<WORKFLOW_ID>/run" -d '{
"args": {
"url": "https://www.amazon.com/s?k=vacuum+cleaners&page=1"
}
}'
Poll for results (run_id returned from the previous call):
curl -H "Authorization: Bearer <API_TOKEN>" "https://api.witrium.com/v1/runs/<RUN_ID>/results"
Repeat for pages 2..32.
Option B: Orchestrate with Python (parallel)
Install the SDK:
pip install witrium
Example script:
import os
import csv
import math
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import quote_plus
from witrium.client import (
SyncWitriumClient,
WorkflowRunResultsSchema,
WorkflowRunStatus,
)
API_TOKEN = "<YOUR_API_TOKEN>"
WORKFLOW_ID = "<YOUR_WORKFLOW_ID>"
SEARCH_TERM = "vacuum cleaners"
TARGET_ITEMS = 500
ITEMS_PER_PAGE = 16 # typical; adjust if your layout differs
TOTAL_PAGES = math.ceil(TARGET_ITEMS / ITEMS_PER_PAGE)
BROWSERS_PER_BATCH = 4
OUTPUT_CSV = "amazon_vacuum_cleaners.csv"
def save_items(items, path=OUTPUT_CSV):
is_new = not os.path.exists(path)
with open(path, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "url", "page"])
if is_new:
writer.writeheader()
for item in items:
row = {
"title": item.get("title"),
"price": item.get("price"),
"url": item.get("url"),
"page": item.get("page"),
}
writer.writerow(row)
def process_results(event: WorkflowRunResultsSchema, page: int) -> int:
# Expecting the workflow to return { "items": [ ... ] }
if event.status == WorkflowRunStatus.COMPLETED:
payload = event.result or {}
items = payload.get("items", [])
# Annotate page for traceability
for it in items:
it["page"] = page
if items:
save_items(items)
return len(items)
else:
# Handle failure
print(f"Page {page} failed: {event.status}")
return 0
# ----- Orchestration --------------------------------------------------------
def page_url(search_term: str, page: int) -> str:
return f"https://www.amazon.com/s?k={quote_plus(search_term)}&page={page}"
def run_single_page(wit_client: SyncWitriumClient, page: int) -> int:
"""Runs one page and persists results. Returns item count for the page."""
print(f"Running workflow for Page {page}...")
# run_workflow_and_wait will run the workflow and wait till completion
result = wit_client.run_workflow_and_wait(
workflow_id=WORKFLOW_ID,
args={"url": page_url(SEARCH_TERM, page)},
)
return process_results(result, page)
if __name__ == "__main__":
with SyncWitriumClient(api_token=API_TOKEN) as client:
pages = list(range(1, TOTAL_PAGES + 1))
# Run in parallel
totals = 0
with ThreadPoolExecutor(max_workers=BROWSERS_PER_BATCH) as pool:
futures = {pool.submit(run_single_page, client, p): p for p in pages}
for fut in as_completed(futures):
p = futures[fut]
try:
count = fut.result()
totals += count
print(f"Page {p} → {count} items")
except Exception as e:
print(f"Page {p} failed: {e}, retrying...")
# Optional: simple retry
time.sleep(1)
try:
count = run_single_page(client, p)
totals += count
print(f"Page {p} (retry) → {count} items")
except Exception as e2:
print(f"Page {p} failed again: {e2}")
print(f"Done. Total items saved: {totals}. CSV: {OUTPUT_CSV}")
What this does:
- Calculates
TOTAL_PAGES = ceil(500 / 16) = 32
. - Launches up to
BROWSERS_PER_BATCH
concurrent page runs. - Saves
title
,price
,url
,page
to a CSV. - Retries a failed page once (simple backoff shown).
Quality & Robustness Tips
- Be explicit in your extraction instruction: structure, fields, URL constraints, and ASIN parsing rule.
- Scroll to the content first to limit noise in the LLM's context.
- Batch politely: keep concurrency reasonable - below the limit of concurrent runs allowed by your plan.
- Region differences: Amazon's layout can vary by locale. Verify the extraction on page 1 before scaling.
- Data hygiene: product URLs may include query params—store canonical path if needed.
Reuse this for other searches
Just change:
- The
k
parameter (e.g.,k=air+purifier
). - The
TARGET_ITEMS
(andITEMS_PER_PAGE
if you observe a different per-page count).
You now have a repeatable, scalable Amazon search scraper built with Witrium.
To learn about other features of Witrium, check out our documentation.
Got questions? Reach out to our support team at support@witrium.com.