Noah Bennett· ChocoData Amazon data expert · 9 min read

How to Scrape Amazon Product Listings (Python Guide)

Updated at 2026-05-08

The Answer

Scraping an Amazon product listing means fetching amazon.com/dp/<ASIN> and parsing 11 to 55 fields out of the product detail page HTML: ASIN, title, Buy Box price, list price, currency, star rating, review count, availability, brand, bullet points, full description, image URLs, variations, Buy Box seller, and category breadcrumbs. The durable Python stack is curl_cffi (for Chrome TLS fingerprint impersonation) plus residential proxies for the request layer, plus BeautifulSoup or lxml for the parsing layer. Stable CSS selectors are #productTitle, .a-price .a-offscreen for prices, #acrPopover for star rating, #acrCustomerReviewText for review count, #feature-bullets for bullet points, and the data-old-hires attribute on the main image. Expect 85 to 95% success rates on first attempt with this stack. For production workloads above a few thousand ASINs per month, Amazon Scraper API wraps all of this in a per-success-billed endpoint starting at $0.90 per 1,000 requests with 1,000 free on signup.

What Is an Amazon Product Listing?

An Amazon product listing is a single product detail page, identified by a 10-character ASIN (Amazon Standard Identification Number) and reached at amazon.com/dp/<ASIN> or the longer SEO-friendly URL amazon.com/<slug>/dp/<ASIN>/ref=.... Each listing contains the full set of merchandising fields a buyer needs to make a purchase decision: who sells the product, what it costs, what it looks like, what other buyers thought of it, what variations exist, and how soon it can ship.

A typical listing page renders 50KB to 500KB of HTML depending on category. A Kindle book page is small. A multi-variant fashion product with 30 size and color combinations and 20 customer-uploaded images can hit half a megabyte.

The same page is used by Amazon’s own product detail rendering and by every search result that links to it, which is why the URL structure is stable across all 20 Amazon marketplaces (amazon.com, amazon.co.uk, amazon.de, amazon.co.jp, etc.).

Why Scrape Amazon Product Listings?

The five most common commercial use cases for product listing scraping:

Price monitoring. Tracking Buy Box prices, list prices, and discount frequency across competitor catalogs to inform repricing decisions.
MAP compliance. Brand owners checking that authorized resellers honor the brand’s Minimum Advertised Price.
Inventory tracking. Watching availability status (in stock, only N left, temporarily out of stock) across a catalog to detect supply chain shifts.
Catalog enrichment. Pulling product titles, descriptions, images, and bullets to build product databases for downstream tools (price comparison sites, affiliate engines, ML training).
Review monitoring. Aggregating star ratings and review counts to track product reception over time. Full review text scraping is a separate workflow (see our Amazon reviews guide).

Each use case has different fields it cares about. Price monitoring needs price plus seller plus availability. MAP compliance needs price plus seller plus URL for evidence. Inventory needs availability only. Catalog enrichment needs the full field set. Review monitoring needs only the aggregate fields.

What Data Can You Extract From an Amazon Product Listing?

You can extract eleven core fields from any public Amazon product listing without logging in: title, ASIN, Buy Box price, list price (struck-through), currency, star rating, review count, availability, brand, image URL, and the first 8 to 10 featured reviews. A full extraction pulls 50+ additional fields: bullet points, full description, category ladder, best-seller rank, dimensions, weight, manufacturer, model number, date first available, variations (size, color, material), Buy Box seller name, Prime eligibility, Amazon’s Choice badge, frequency of price changes, and the histogram of star ratings.

The base 11-field extraction is what most pricing and MAP tools need. The full 50+ field extraction is what catalog enrichment and product database workflows need.

Two fields are newly harder. The full review list moved behind a login wall on November 5, 2024, so only 3 to 8 featured reviews are publicly scrapable per ASIN. Per-user pricing variation also became more visible in 2025 because Amazon now sometimes shows different prices to different sessions based on cookies, location, and Prime membership.

How Do You Set Up the Python Stack for Amazon Scraping?

You set up the Python stack with three packages: curl_cffi for browser-fingerprint TLS impersonation, beautifulsoup4 plus lxml for HTML parsing, and a residential proxy provider for the IP layer. The base setup:

pip install curl_cffi beautifulsoup4 lxml

The minimum viable scraper:

from curl_cffi import requests
from bs4 import BeautifulSoup

PROXY_URL = "http://user:[email protected]:8080"

ROBOT_MARKERS = (
    "captchacharacters",
    "Enter the characters you see below",
    "Robot Check",
)

class AmazonBlocked(RuntimeError):
    pass

def fetch_product(asin: str, domain: str = "com") -> str:
    url = f"https://www.amazon.{domain}/dp/{asin}"
    resp = requests.get(
        url,
        impersonate="chrome",
        proxies={"http": PROXY_URL, "https": PROXY_URL},
        timeout=30,
    )
    resp.raise_for_status()
    if any(marker in resp.text for marker in ROBOT_MARKERS):
        raise AmazonBlocked(f"Robot check on {url}")
    return resp.text

Three things in the request matter:

impersonate="chrome" makes the request look like a real Chrome session at the TLS layer, including the cipher-suite ordering and ALPN extensions Amazon’s WAF inspects in the first packet.
The 30-second timeout matches Amazon’s typical p95 response time under load. Tighter timeouts cause false negatives during normal request bursts.
The robot-check guard catches the three HTML signatures Amazon serves with status 200 when its anti-bot fires. Without this guard, your scraper silently writes empty or garbage data.

For setup details on the residential proxy provider, see our best proxies for Amazon scraping post.

What CSS Selectors Pull Amazon Listing Data?

The selectors that work as of April 2026, ranked by stability and grouped by field:

Title and ASIN:

Title: #productTitle
ASIN: extract from URL path /dp/<ASIN>/ or from <input id="ASIN"> value attribute

Price (Buy Box):

Primary: .a-price[data-a-color="base"] .a-offscreen (the accessibility-text version, most stable)
Fallback: #corePrice_feature_div .a-offscreen
Whole + fraction split: .a-price-whole + .a-price-fraction (less stable, breaks every 4 to 8 weeks)

List price (strikethrough):

.a-price[data-a-color="secondary"] .a-offscreen
Fallback: [data-a-strike="true"] .a-offscreen

Star rating:

Hidden text: #acrPopover .a-icon-alt (returns “4.7 out of 5 stars”)
Visible badge: .a-icon-star

Review count:

#acrCustomerReviewText (returns “12,450 ratings”)

Availability:

#availability span (returns “In Stock” or “Only 3 left in stock - order soon”)

Brand:

#bylineInfo (the “by Brand” link under the title)
Fallback: meta <a id="bylineInfo"> attribute

Bullet points:

#feature-bullets ul li:not(.aok-hidden) span.a-list-item

Hero image:

#landingImage (the data-old-hires attribute carries the highest-resolution URL)
Image gallery JSON: extract from inline <script> containing colorImages block

Buy Box seller:

#sellerProfileTriggerId (the seller-name link inside the Buy Box)
Fallback: #tabular-buybox table row

Category breadcrumb:

#wayfinding-breadcrumbs_feature_div ul li

The accessibility-text selectors (.a-offscreen) are the most durable. They sit behind Amazon’s screen-reader contract, which changes far less often than the visible layout.

How Do You Parse a Product Listing With BeautifulSoup?

The full extraction function with type hints:

import re
from dataclasses import dataclass, field
from typing import Optional
from bs4 import BeautifulSoup

CURRENCY_MAP = {"$": "USD", "£": "GBP", "€": "EUR", "¥": "JPY", "₹": "INR"}

@dataclass
class Product:
    asin: str
    title: Optional[str] = None
    brand: Optional[str] = None
    price: Optional[float] = None
    list_price: Optional[float] = None
    currency: Optional[str] = None
    rating: Optional[float] = None
    reviews_count: Optional[int] = None
    availability: Optional[str] = None
    bullets: list[str] = field(default_factory=list)
    image_url: Optional[str] = None
    seller: Optional[str] = None
    category: list[str] = field(default_factory=list)

def parse_product(html: str, asin: str) -> Product:
    soup = BeautifulSoup(html, "lxml")
    return Product(
        asin=asin,
        title=_text(soup.select_one("#productTitle")),
        brand=_brand(soup),
        price=_parse_price(soup.select_one('.a-price[data-a-color="base"] .a-offscreen')),
        list_price=_parse_price(soup.select_one('.a-price[data-a-color="secondary"] .a-offscreen')),
        currency=_currency(soup),
        rating=_parse_rating(soup),
        reviews_count=_parse_reviews(soup),
        availability=_text(soup.select_one("#availability span")),
        bullets=_bullets(soup),
        image_url=_hero_image(soup),
        seller=_text(soup.select_one("#sellerProfileTriggerId")),
        category=_category(soup),
    )

def _text(el) -> Optional[str]:
    return el.get_text(strip=True) if el else None

def _parse_price(el) -> Optional[float]:
    if not el:
        return None
    m = re.search(r"([0-9][0-9,]*\.?[0-9]*)", el.get_text())
    return float(m.group(1).replace(",", "")) if m else None

def _currency(soup) -> Optional[str]:
    sym_el = soup.select_one('.a-price[data-a-color="base"] .a-price-symbol')
    if sym_el:
        return CURRENCY_MAP.get(sym_el.get_text(strip=True))
    offscreen = soup.select_one('.a-price[data-a-color="base"] .a-offscreen')
    if not offscreen:
        return None
    text = offscreen.get_text()
    for sym, code in CURRENCY_MAP.items():
        if sym in text:
            return code
    return None

def _parse_rating(soup) -> Optional[float]:
    el = soup.select_one("#acrPopover .a-icon-alt") or soup.select_one(".a-icon-star .a-icon-alt")
    if not el:
        return None
    m = re.match(r"([0-9.]+) out of", el.get_text())
    return float(m.group(1)) if m else None

def _parse_reviews(soup) -> Optional[int]:
    el = soup.select_one("#acrCustomerReviewText")
    if not el:
        return None
    m = re.search(r"([0-9][0-9,]*)", el.get_text())
    return int(m.group(1).replace(",", "")) if m else None

def _bullets(soup) -> list[str]:
    return [
        li.get_text(strip=True)
        for li in soup.select("#feature-bullets ul li:not(.aok-hidden) span.a-list-item")
    ]

def _brand(soup) -> Optional[str]:
    el = soup.select_one("#bylineInfo")
    if not el:
        return None
    text = el.get_text(strip=True)
    return re.sub(r"^(Visit the |Brand: |by )", "", text).strip().rstrip(" Store")

def _hero_image(soup) -> Optional[str]:
    el = soup.select_one("#landingImage")
    return el.get("data-old-hires") or el.get("src") if el else None

def _category(soup) -> list[str]:
    return [
        a.get_text(strip=True)
        for a in soup.select("#wayfinding-breadcrumbs_feature_div ul li a")
    ]

The _brand cleanup function strips Amazon’s UI prefixes (“Visit the X Store”, “Brand: X”, “by X”) to leave just the brand name. The _category function returns the breadcrumb as an ordered list, root category first.

How Do You Handle Variations?

Variations (size, color, material, capacity) live in a separate JSON block embedded in the page HTML. The parent ASIN renders the page; the variation ASINs hang off it. The block is inside an inline <script> tag that builds the variation picker on the client side.

import json

def parse_variations(html: str) -> dict:
    """Returns {variation_label: child_asin} for every variation on the page."""
    soup = BeautifulSoup(html, "lxml")
    twister_data = soup.find("script", string=re.compile("twister_data"))
    if not twister_data:
        return {}
    m = re.search(r"twister_data\s*=\s*(\{.*?\});", twister_data.string)
    if not m:
        return {}
    try:
        data = json.loads(m.group(1))
    except json.JSONDecodeError:
        return {}
    return {
        " / ".join(combo): asin
        for asin, combo in data.get("dimensionToAsinMap", {}).items()
    }

The JSON shape varies slightly across categories (electronics use dimensionToAsinMap, fashion uses colorToAsinMap). For full coverage, parse all known variation maps and merge.

How Do You Scrape Multiple Listings in Bulk?

The bulk pattern combines a per-ASIN extractor with a controlled-concurrency runner. Python’s concurrent.futures.ThreadPoolExecutor is the simplest fit:

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

def scrape_one(asin: str) -> Optional[Product]:
    try:
        html = fetch_product(asin)
        return parse_product(html, asin)
    except AmazonBlocked:
        time.sleep(5)
        try:
            html = fetch_product(asin)
            return parse_product(html, asin)
        except AmazonBlocked:
            return None
    except Exception as exc:
        print(f"failed {asin}: {exc}")
        return None

def scrape_bulk(asins: list[str], workers: int = 8) -> list[Product]:
    results = []
    with ThreadPoolExecutor(max_workers=workers) as pool:
        futures = {pool.submit(scrape_one, a): a for a in asins}
        for fut in as_completed(futures):
            r = fut.result()
            if r:
                results.append(r)
    return results

Three rules for bulk:

Cap concurrency to your residential proxy pool size divided by 5. With 50 IPs in the rotation, 10 concurrent workers leaves headroom for retries.
Always implement single-retry logic on robot-check failures. The first attempt may hit a flagged IP; the rotation gives you a fresh one on retry.
Log every failure with the ASIN. Aggregate failure rate is the early-warning signal that Amazon has rotated its anti-bot stack and the scraper needs an update.

How Do You Avoid Getting Blocked?

Five practices that materially raise the success rate of any Amazon listing scraper:

Country-matched proxy. A US IP scraping amazon.de triggers Amazon’s geo-mismatch detection in milliseconds. Match the proxy country to the TLD.
Rotate IPs per request. Amazon’s per-IP rate limit cuts in around 30 requests per minute. Rotating per request means no single IP ever crosses the threshold.
Realistic User-Agent. A User-Agent: python-requests/2.x header is an immediate flag. A current Chrome or Safari User-Agent is required.
Browser-impersonation TLS via curl_cffi. Standard Python requests produces a TLS fingerprint Amazon flags in the first packet regardless of how good the rest of the code is.
Aggressive caching. A product page changes maybe once per day for most categories. Caching for 30 to 60 minutes cuts your bandwidth, your cost, and your detection risk by the same factor.

For a deeper treatment, see our bypass Amazon CAPTCHA guide.

What’s the Difference Between a DIY Scraper and a Managed API?

Three differences:

Maintenance. A DIY scraper needs CSS-selector updates every 4 to 8 weeks when Amazon ships a layout change. A managed API absorbs this on the vendor side.
Anti-bot orchestration. A DIY scraper needs proxy rotation, TLS fingerprinting, and retry logic that you maintain. A managed API delivers a structured JSON response with all of that handled internally.
Cost predictability. A DIY scraper costs proxy bandwidth plus engineering time (estimated at 4 to 8 hours per month for a working production scraper, plus incident response when Amazon rotates its stack). A managed API is a flat per-success rate.

For workloads above 50,000 ASINs per month, the managed API path is cheaper on total cost of ownership. Amazon Scraper API returns the same field set the parser above produces, with the proxy + retry + parsing layers handled internally, at $0.50 to $0.90 per 1,000 successful requests. For ad-hoc, low-volume scraping (under 5,000 ASINs per month), the DIY path is competitive on cost.

FAQ

What’s the most important field to extract from an Amazon listing?

It depends on the use case. For pricing tools, it’s the Buy Box price plus Buy Box seller plus availability. For MAP compliance, the same three plus the URL as evidence. For catalog enrichment, the full 50+ field set. The base 11-field extraction (title, ASIN, price, list price, currency, rating, review count, availability, brand, image, seller) covers 80% of commercial use cases.

How do you extract the ASIN from an Amazon URL?

ASINs sit in the URL path immediately after /dp/ or /gp/product/. The regex /dp/([A-Z0-9]{10})/? matches every public Amazon product URL format. ASINs are always 10 alphanumeric characters, uppercase letters, and digits.

How do you scrape Amazon product images?

The hero image URL is in the data-old-hires attribute on #landingImage. The full image gallery is inside an inline <script> tag that contains a colorImages JSON block. Parse the script content with regex to extract all variant image URLs in their full resolution. Amazon’s CDN URL pattern is https://m.media-amazon.com/images/I/<image-id>._<resize-spec>.jpg where <resize-spec> controls the rendered size.

Can you scrape Amazon products without programming?

Yes, through no-code tools like Apify’s Amazon Product Scraper actor, Octoparse, or Parsehub. Each gives you a UI-driven scraper that produces CSV or JSON output without writing Python. The trade-off is cost (no-code tools price 5 to 10x higher than DIY Python) and inflexibility on edge cases. For long-running production work, programming wins. For one-off ad-hoc scrapes, no-code wins.

How many product listings can you scrape per day?

With one residential IP at the conservative 3 requests per minute rate, you scrape 4,320 listings per day. With a rotating residential pool of 50 IPs at the same per-IP rate, 216,000 per day. Production scrapers typically run hundreds of thousands of listings per day on rotating residential pools. Above 1 million per day, the math favors a managed API on cost.

What’s the difference between scraping product listings and scraping search results?

A product listing is one detail page (one ASIN, deep field set). A search result is a list of summary cards (many ASINs, shallow field set). For catalog discovery, scrape search first to find the ASINs, then scrape each ASIN’s product listing for the full data. See our scrape Amazon search results guide for the search-side workflow.

What’s the best Amazon product scraper API?

For pay-per-success billing across 20 marketplaces with structured JSON output and 55 product fields, Amazon Scraper API starts at $0.90 per 1,000 successful requests. ScraperAPI, Bright Data, Oxylabs, and Apify all offer Amazon product endpoints at varying price points. The full vendor comparison is in our best Amazon scrapers post.

Sources

Web Scraper - How to Scrape Amazon Product Listings - selector reference
BeautifulSoup documentation - HTML parsing
curl_cffi - browser TLS impersonation
Amazon - ASIN format reference - canonical ASIN definition
Aimultiple - Best Amazon Scrapers 2026 - independent benchmarks