How to Build a Reliable Amazon Search Results Scraper
GuideHow-to

How to Build a Reliable Amazon Search Results Scraper

August 22, 2025

Purpose: Extract the title, price, and product page URL from Amazon search results using Witrium's no-code workflows combined with the Python SDK for orchestration of parallel runs to achieve scale.

Legal note: Always review and comply with Amazon's Terms of Service and applicable laws. Use respectful rates, avoid abuse, and prefer official APIs when available.

What we're building

Let's say you need to scrape details from the top 500 Amazon search results for "vacuum cleaners" - extracting titles, prices, and product URLs. A manual approach would require navigating through 32+ pages (Amazon typically shows 16 results per page - depending on locale/device), maintaining a long browser session, and keeping unnecessary context in memory, which degrades performance.

This tutorial shows you a smarter approach with Witrium: Don't keep one mega browser session alive to scrape across all 32 pages. Instead, define one single-page extraction workflow in Witrium, then run it across all the pages simultaneously by tweaking a URL parameter and parallelizing the runs.

  • Base search URL (page 1):

    https://www.amazon.com/s?k=vacuum+cleaners&page=1

  • We'll change k (search term) and page for subsequent runs to cover all pages, and even reuse this workflow for other keywords.

Traditional Approaches vs. Witrium

There are several approaches to web scraping, each with significant drawbacks. Here are the most common ones:

1. Traditional Automation Scripts (Selenium/Playwright)

Finely crafted scripts that use Selenium, Playwright, or Puppeteer to navigate the web and extract data.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

driver = webdriver.Chrome()
driver.get("https://amazon.com/s?k=vacuum+cleaners")
products = driver.find_elements(By.CSS_SELECTOR, ".s-result-item")
# ... hundreds of lines of brittle selector logic

Major Drawbacks:

  • Painstaking Development: Hours spent inspecting element IDs, CSS selectors, and XPath expressions
  • Extremely Brittle: Scripts break whenever Amazon changes their HTML structure (which happens frequently)
  • Debugging Nightmare: Cryptic failures when selectors stop working, requiring constant maintenance
  • Infrastructure Overhead: Need to deploy and manage your own browser infrastructure
  • Stealth Complexity: Manually implementing anti-detection measures, proxy rotation, and browser fingerprinting
  • Limited Adaptability: Hard-coded selectors can't adapt to dynamic content or layout changes

2. DIY AI Agent Solutions

Building your own AI-powered web scraper using LLMs to interpret page content.

Overwhelming Complexity:

  • Prompt Engineering: Months of iteration to craft prompts that work reliably across different page layouts
  • Context Window Management: Manually chunking and optimizing content to fit model limitations
  • Model Selection: Researching and testing different LLMs for accuracy, speed, and cost trade-offs
  • Agent Deployment: Setting up infrastructure to run your AI agent at scale
  • Browser Infrastructure: Still need to manage headless browsers, sessions, and stealth capabilities
  • Cost Optimization: Balancing model performance against API costs per extraction

3. General-purpose AI Agents (Browser-Use, OpenAI Operator)

Using general-purpose AI browsing agents that can navigate websites autonomously.

Fundamental Limitations:

  • Exploratory Nature: Designed for one-off tasks, not repeatable production workflows
  • Lack of Robustness: Can get confused and veer off to different tangents that don't converge to the user's goals
  • Inconsistent Results: May extract different data formats or miss elements across runs
  • No Workflow Optimization: Can't leverage patterns or shortcuts for better performance
  • Poor Scalability: Not built for high-volume, parallel processing scenarios

4. Managed Browser Services (Browserbase, Browserless)

Using cloud browser services to handle the infrastructure while you build the scraping logic.

Still Your Problem:

  • Resource Management: You're responsible for scaling, session handling, and cost optimization
  • Session Complexity: Managing authentication, cookies, and state across multiple requests
  • Script Maintenance: All the brittleness of traditional scripts still applies

The Witrium Advantage

Witrium eliminates all these pain points by providing a purpose-built platform for intelligent web automation:

No Code Required - Visual workflow builder instead of brittle scripts
AI-Powered Adaptability - Automatically adapts to layout changes
Fully Managed Infrastructure - Browser deployment, scaling, and stealth handled automatically
Optimized for Repeatability - Built specifically for production workflows, not exploration
Integrated Scaling - Parallel processing and resource management included
Zero Maintenance - No agent engineering, prompt engineering, model selection, or infrastructure management

Simply define your workflow, and Witrium handles the rest.

What you'll learn:

  • Building a no-code Amazon product extraction workflow
  • Leveraging URL parameters for scalable scraping
  • Orchestrating parallel workflows using Witrium's Python SDK
  • Optimizing performance through smart pagination handling

Prerequisites

  • A Witrium account and an API token (for orchestration) that you can find in the Witrium dashboard
  • Basic Python environment (optional, for orchestration)

Step-by-Step Implementation

Below are the build steps exactly in the intended order, with added clarity and developer-friendly details.

1) Create the workflow

  • Once created an account and logged in, go to https://witrium.com/workflows and click Create Workflow.
  • Name it "Amazon Products" and click Create Workflow.

Creating a workflow

2) Set the Target URL

Set the workflow's Target URL strategically to maximize reusability:

https://www.amazon.com/s?k=vacuum+cleaners&page=1

Why this matters: On later runs, you can override the entire Target URL. That lets you:

  • A) swap the k parameter to run other search terms (e.g., air+purifier).
  • B) change the page number to paginate (1 → 32). No need to navigate through pagination manually.

Setting the target URL

3) Start a Build Session

Click Start Build Session. Witrium spins up a cloud-based Chrome browser instance and opens your Target URL. You'll see a live stream of the browser on the right, so you can test each instruction as you add it.

Starting a build session

4) Add a scroll instruction (Instruction #2 in the workflow)

Add the following instruction:

Scroll down by 350 pixels

Why: This hides Amazon's header/ads that are usually on top of the page and brings the LLM's focus to the actual product list. Witrium provides the page context to the agent from the current scroll position to the bottom. If the content you want begins lower on the page, it is recommended to scroll first to the beginning of that content so the model isn't distracted by non-relevant matter located above.

Click Play on this instruction to verify the page scroll by 350 pixels.

Playing the scroll instruction

5) Add the data extraction instruction (JSON output)

Add a detailed extraction instruction (be explicit and consistent to keep the results across runs stable):

Extract the following data for all products visible on the page and return as JSON:

1) `title`: the full product title (string).
2) `price`: the listed price as shown on the page (string). Do not round or normalize. Include currency symbol if present.
3) `url`: the product page URL (string). It must start with "https://www.amazon.com" or "https://amazon.com". Do not make this up if not available. Leave it as null if not available.

Return a top-level JSON object: { "items": [ ... ] }

Tip: You can also specify an explicit JSON schema to enforce structure. Keep instructions deterministic and avoid vague phrasing.

Click Play to run the extraction. Witrium will interpret the page and render the JSON results beneath the instruction. For dense pages, it may take a minute or two.

Playing the extraction instruction

6) End the Build Session

Once the extracted data displayed below the instruction looks correct, click End Build Session. Your single-page extractor is complete.

Ending the build session

7) Orchestrate 32 pages (REST or Python)

We'll run this workflow 32 times, changing the page parameter on each run. You can:

  • Call the workflow's auto-generated REST endpoint, or
  • Use the Witrium Python SDK to parallelize runs.

Witrium manages concurrent browsers and infrastructure; you simply collect results.

Option A: Trigger via REST (example)

Replace placeholders like <WORKFLOW_ID> and <API_TOKEN> with your values.

Kick off a run for page 1:

curl -X POST   -H "Authorization: Bearer <API_TOKEN>"   -H "Content-Type: application/json"   "https://api.witrium.com/v1/workflows/<WORKFLOW_ID>/run"   -d '{
    "args": {
      "url": "https://www.amazon.com/s?k=vacuum+cleaners&page=1"
    }
  }'

Poll for results (run_id returned from the previous call):

curl -H "Authorization: Bearer <API_TOKEN>"   "https://api.witrium.com/v1/runs/<RUN_ID>/results"

Repeat for pages 2..32.

Option B: Orchestrate with Python (parallel)

Install the SDK:

pip install witrium

Example script:

import os
import csv
import math
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import quote_plus
from witrium.client import (
    SyncWitriumClient,
    WorkflowRunResultsSchema,
    WorkflowRunStatus,
)


API_TOKEN = "<YOUR_API_TOKEN>"
WORKFLOW_ID = "<YOUR_WORKFLOW_ID>"
SEARCH_TERM = "vacuum cleaners"
TARGET_ITEMS = 500
ITEMS_PER_PAGE = 16  # typical; adjust if your layout differs
TOTAL_PAGES = math.ceil(TARGET_ITEMS / ITEMS_PER_PAGE)
BROWSERS_PER_BATCH = 4
OUTPUT_CSV = "amazon_vacuum_cleaners.csv"


def save_items(items, path=OUTPUT_CSV):
    is_new = not os.path.exists(path)
    with open(path, "a", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["title", "price", "url", "page"])
        if is_new:
            writer.writeheader()
        for item in items:
            row = {
                "title": item.get("title"),
                "price": item.get("price"),
                "url": item.get("url"),
                "page": item.get("page"),
            }
            writer.writerow(row)


def process_results(event: WorkflowRunResultsSchema, page: int) -> int:
    # Expecting the workflow to return { "items": [ ... ] }
    if event.status == WorkflowRunStatus.COMPLETED:
        payload = event.result or {}
        items = payload.get("items", [])

        # Annotate page for traceability
        for it in items:
            it["page"] = page

        if items:
            save_items(items)
        return len(items)
    else:
        # Handle failure
        print(f"Page {page} failed: {event.status}")
    return 0


# ----- Orchestration --------------------------------------------------------


def page_url(search_term: str, page: int) -> str:
    return f"https://www.amazon.com/s?k={quote_plus(search_term)}&page={page}"


def run_single_page(wit_client: SyncWitriumClient, page: int) -> int:
    """Runs one page and persists results. Returns item count for the page."""
    print(f"Running workflow for Page {page}...")
    # run_workflow_and_wait will run the workflow and wait till completion
    result = wit_client.run_workflow_and_wait(
        workflow_id=WORKFLOW_ID,
        args={"url": page_url(SEARCH_TERM, page)},
    )

    return process_results(result, page)


if __name__ == "__main__":
    with SyncWitriumClient(api_token=API_TOKEN) as client:
        pages = list(range(1, TOTAL_PAGES + 1))

        # Run in parallel
        totals = 0
        with ThreadPoolExecutor(max_workers=BROWSERS_PER_BATCH) as pool:
            futures = {pool.submit(run_single_page, client, p): p for p in pages}
            for fut in as_completed(futures):
                p = futures[fut]
                try:
                    count = fut.result()
                    totals += count
                    print(f"Page {p} → {count} items")
                except Exception as e:
                    print(f"Page {p} failed: {e}, retrying...")
                    # Optional: simple retry
                    time.sleep(1)
                    try:
                        count = run_single_page(client, p)
                        totals += count
                        print(f"Page {p} (retry) → {count} items")
                    except Exception as e2:
                        print(f"Page {p} failed again: {e2}")

        print(f"Done. Total items saved: {totals}. CSV: {OUTPUT_CSV}")

What this does:

  • Calculates TOTAL_PAGES = ceil(500 / 16) = 32.
  • Launches up to BROWSERS_PER_BATCH concurrent page runs.
  • Saves title, price, url, page to a CSV.
  • Retries a failed page once (simple backoff shown).

Quality & Robustness Tips

  • Be explicit in your extraction instruction: structure, fields, URL constraints, and ASIN parsing rule.
  • Scroll to the content first to limit noise in the LLM's context.
  • Batch politely: keep concurrency reasonable - below the limit of concurrent runs allowed by your plan.
  • Region differences: Amazon's layout can vary by locale. Verify the extraction on page 1 before scaling.
  • Data hygiene: product URLs may include query params—store canonical path if needed.

Reuse this for other searches

Just change:

  • The k parameter (e.g., k=air+purifier).
  • The TARGET_ITEMS (and ITEMS_PER_PAGE if you observe a different per-page count).

You now have a repeatable, scalable Amazon search scraper built with Witrium.

To learn about other features of Witrium, check out our documentation.

Got questions? Reach out to our support team at support@witrium.com.