How to Scrape Amazon Products in 2026 (Python + API Guide)
Updated at
Scraping Amazon products is harder than it was in 2022 and easier than it was in 2024, depending on which side of the defense-offense balance you pick. The pages themselves still hold the same fields (title, ASIN, price, rating, review count, Buy Box seller, availability), but Amazon now serves robot check HTML to any script whose TLS fingerprint, IP range, or request cadence does not look like a real Chrome session. The playbook that works today is Python plus curl_cffi plus residential proxies, or a managed scraper API that does all three on its side.
This guide covers the whole pipeline: the legal boundaries, the fields available on a product page, the selectors that pull them, working Python code that runs end-to-end, and where a managed API becomes cheaper than a DIY stack. All code uses the Amazon Scraper API as the managed option, from $0.50 per 1,000 requests on Custom plans ($0.90 per 1,000 on pay-as-you-go) with 1,000 free on signup.
The Answer
To scrape Amazon products, pair curl_cffi with impersonate="chrome" to replay a real Chrome TLS fingerprint, route through residential proxies, and parse the product HTML with BeautifulSoup using the documented selectors (#productTitle, .a-price-whole + .a-price-fraction, #acrPopover, #acrCustomerReviewText). Expect 85% to 98% success rates on product detail pages with this stack. For volumes above a few thousand ASINs per month, a managed API like the Amazon Scraper API starting from $0.90 per 1,000 on pay-as-you-go (down to $0.50 per 1,000 on Custom plans) is cheaper than maintaining your own proxy pool and fingerprint library. The full working Python script is included below.
Is It Legal to Scrape Amazon Products?
Scraping Amazon product data that a logged-out user can see is generally legal in the United States, based on the Ninth Circuit ruling in hiQ Labs v. LinkedIn, which held that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA). That precedent covers product titles, prices, ratings, review counts, Buy Box sellers, and the aggregate review data visible without login. It does not cover scraping behind an Amazon login, buyer order history, seller dashboards, or the private parts of Seller Central.
Amazon’s own Conditions of Use prohibit automated access, which is a separate contract matter from federal law. The practical consequence is that Amazon can ban IP ranges, revoke Associates accounts, and send cease-and-desist letters, but it cannot criminally prosecute scrapers of public product data in the US. For commercial teams, the defensible posture is to scrape only what a logged-out browser can see, respect rate limits and robots.txt, avoid any flow that requires an Amazon login, and use the data for legitimate purposes (price monitoring, MAP compliance, catalog enrichment, market research). This is not legal advice, and teams operating at scale should consult counsel in their jurisdiction.
What Data Can You Extract From Amazon Product Pages?
You can extract eleven core fields from an Amazon product page without logging in: product title, ASIN, Buy Box price, list price (struck-through original), currency, star rating, review count, star distribution, Buy Box seller, availability status, and the first few featured reviews. A full product detail page also exposes brand, category, best-seller rank, variant options (size, color), shipping estimate, and Prime eligibility, though these sit behind less stable selectors and break more often when Amazon ships HTML updates.
Two fields are newly harder. The full review list moved behind a login wall on November 5, 2024, so only 3 to 8 featured reviews are publicly scrapable per ASIN (covered in detail in How to Scrape Amazon Reviews With Python). The per-user price variation is another moving target because Amazon sometimes shows different prices to different sessions based on cookies, location, and Prime membership, so a “clean” unauthenticated scrape returns the guest price, which is usually but not always the retail price a logged-in user sees.
Three fields are worth singling out because they are what most commercial scrapers actually need:
- Buy Box price - the big yellow-button price, which rotates based on which seller currently holds the Buy Box. This is the price that actually gets charged at checkout.
- Star rating and review count - the trust signals that move conversion rate. Available via
#acrPopoverand#acrCustomerReviewTextwithout scraping any individual review text. - Availability status - “In Stock”, “Only 3 left in stock”, “Temporarily out of stock”, or variant-specific messages, all under
#availability.
What Are the Main Challenges When Scraping Amazon?
The main challenges when scraping Amazon are four signals that Amazon’s anti-bot stack combines to flag automated traffic: IP range classification, TLS fingerprint, header shape, and request cadence. A single signal rarely trips the check. The combination does. That is why changing only the User-Agent or only the proxy almost never works on Amazon, and why most “quick Amazon scraper” tutorials fail within hours.
How Does Amazon Detect Bots?
Amazon detects bots through AWS WAF Bot Control plus a proprietary risk model that inspects the TLS client hello in the first milliseconds of the connection, the HTTP/2 frame order, the full header set (User-Agent, Accept-Language, Sec-Fetch-*, Upgrade-Insecure-Requests), the request cadence from the IP, and the referer chain. Each signal is a feature in a detection model. Real browsers produce a consistent feature vector. Python’s default requests library produces a fingerprint no real browser has ever sent.
When the model flags a request as automated, Amazon returns HTTP 200 with a “Robot Check” HTML body instead of the product page. The response still looks successful to a naive scraper. The #productTitle element is missing, the .a-price-whole element is missing, and the parse yields nothing. This is the single biggest silent-failure trap in Amazon scraping.
Why Do Simple Python Scripts Fail on Amazon?
Simple Python scripts fail on Amazon because requests.get() sends a TLS handshake that no real Chrome browser ever produces, which is enough to trigger the robot check regardless of how good the rest of the code is. The default requests library and urllib3 negotiate a cipher suite order and ALPN extension set that Amazon fingerprints as “Python client” in the first round trip. Even with a rotating User-Agent, a residential proxy, and a polite rate limit, the TLS layer alone is enough to flip the response to robot check HTML.
The fix is to swap requests for curl_cffi, which is a Python binding for a patched curl build that replays real browser TLS and HTTP/2 fingerprints. This is the single highest-leverage change in any Amazon scraper. Independent recent benchmarks show that swapping TLS fingerprint alone raises success rates from roughly 30% on default Python to 60 to 70%, and combining curl_cffi with residential proxies reaches 85 to 90% on most targets.
How Do You Scrape Amazon Products With Python?
You scrape Amazon products with Python by combining four components: curl_cffi for browser-shaped TLS requests, a residential proxy pool for IP diversity, BeautifulSoup for HTML parsing, and a robot-check detector that catches silent failures. The complete flow is fetch → check for CAPTCHA → parse → return. Everything else (retry logic, concurrency, persistence) is scaffolding around that core.
What Libraries Do You Need?
You need three libraries for Amazon scraping in Python: curl_cffi for TLS-impersonated HTTP requests, beautifulsoup4 with lxml for HTML parsing, and the standard library dataclasses module for structured output. Install all three with a single pip command:
pip install curl_cffi beautifulsoup4 lxml
curl_cffi is the only unusual dependency. It ships a patched curl build via CFFI and supports Chrome, Safari, Firefox, and Edge profiles. Chrome is the default choice for Amazon because it is the most common real-browser fingerprint and the least likely to be flagged as out-of-distribution. For workflows that need JavaScript execution (rare on Amazon product pages, which render server-side), add playwright with the playwright-stealth plugin, but for title, price, rating, review count, Buy Box, and availability scraping, curl_cffi alone is sufficient.
How Do You Bypass the TLS Fingerprint Check?
You bypass the TLS fingerprint check by calling curl_cffi.requests.get(url, impersonate="chrome"), which routes the request through a curl build that replays Chrome’s exact TLS client hello, JA3 hash, and HTTP/2 frame order. This is a one-line change from requests.get() and is worth more than any other single modification to an Amazon scraper.
from curl_cffi import requests
resp = requests.get(
"https://www.amazon.com/dp/B08N5WRWNW",
impersonate="chrome",
headers={
"Accept-Language": "en-US,en;q=0.9",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Upgrade-Insecure-Requests": "1",
},
proxies={"https": "http://user:pass@residential.provider.com:8001"},
timeout=25,
)
impersonate="chrome" automatically sets the User-Agent to a matching recent Chrome version, so do not override it with a mismatched string. Add Accept-Language explicitly to match the target storefront (en-US for amazon.com, de-DE for amazon.de). Route through a residential proxy. Those three changes together push success rates above 85% on product detail pages.
What Selectors Pull the Product Fields?
The selectors that pull the product fields are well-documented and stable across most of Amazon’s product page variants. Target the following elements with BeautifulSoup:
#productTitle- the product title string, inside a<span>.a-price-wholeand.a-price-fraction- split-format Buy Box price (dollars and cents as separate spans).a-offscreen- fallback price in a single hidden span with the full formatted string like$29.99#acrPopover- element whosetitleattribute holds the star rating ("4.6 out of 5 stars")#acrCustomerReviewText- the review count string ("12,345 ratings")#histogramTable- the 5-row star distribution table#sellerProfileTriggerId- the Buy Box seller name, when present#merchant-info- fallback seller text ("Ships from and sold by Amazon.com")#availability- the availability status string
The script below pulls all of these fields in one pass and returns a structured Product dataclass. Notice the split-price fallback: Amazon’s newer split format (.a-price-whole + .a-price-fraction) is primary, and the older .a-offscreen hidden span is the fallback.
import re
from dataclasses import dataclass, asdict
from curl_cffi import requests
from bs4 import BeautifulSoup
CURRENCY_MAP = {"$": "USD", "£": "GBP", "€": "EUR", "¥": "JPY", "₹": "INR"}
ROBOT_MARKERS = (
"Robot Check",
"Enter the characters you see below",
"/errors/validateCaptcha",
)
@dataclass
class Product:
asin: str
title: str
price: float | None
currency: str | None
list_price: float | None
rating: float | None
review_count: int | None
seller: str | None
in_stock: bool
def is_robot_check(html: str) -> bool:
return any(m in html for m in ROBOT_MARKERS)
def _parse_price(soup: BeautifulSoup) -> tuple[float | None, str | None]:
whole = soup.select_one(".a-price-whole")
frac = soup.select_one(".a-price-fraction")
if whole and frac:
whole_txt = re.sub(r"[^\d]", "", whole.get_text())
frac_txt = re.sub(r"[^\d]", "", frac.get_text())
if whole_txt:
return float(f"{whole_txt}.{frac_txt or '00'}"), "USD"
off = soup.select_one(".a-offscreen")
if off:
txt = off.get_text().strip()
match = re.search(r"([^\d.,\s])\s?([\d,]+(?:\.\d+)?)", txt)
if match:
symbol, number = match.groups()
return float(number.replace(",", "")), CURRENCY_MAP.get(symbol, "USD")
return None, None
def _parse_list_price(soup: BeautifulSoup) -> float | None:
node = soup.select_one(".basisPrice.a-offscreen") or soup.select_one(".a-text-strike")
if node:
match = re.search(r"[\d,]+(?:\.\d+)?", node.get_text())
if match:
return float(match.group().replace(",", ""))
return None
def _parse_rating(soup: BeautifulSoup) -> float | None:
node = soup.select_one("#acrPopover")
if node and node.get("title"):
match = re.search(r"([\d.]+)", node["title"])
if match:
return float(match.group(1))
return None
def _parse_review_count(soup: BeautifulSoup) -> int | None:
node = soup.select_one("#acrCustomerReviewText")
if node:
match = re.search(r"([\d,]+)", node.get_text())
if match:
return int(match.group(1).replace(",", ""))
return None
def _parse_seller(soup: BeautifulSoup) -> str | None:
node = soup.select_one("#sellerProfileTriggerId") or soup.select_one("#merchant-info")
return node.get_text(strip=True) if node else None
def _parse_availability(soup: BeautifulSoup) -> bool:
node = soup.select_one("#availability")
if not node:
return True
text = node.get_text(strip=True).lower()
return not any(m in text for m in ("unavailable", "out of stock", "currently"))
def scrape_product(asin: str, proxy: str | None = None) -> Product:
url = f"https://www.amazon.com/dp/{asin}"
resp = requests.get(
url,
impersonate="chrome",
proxies={"https": proxy} if proxy else None,
headers={"Accept-Language": "en-US,en;q=0.9"},
timeout=25,
)
html = resp.text
if is_robot_check(html):
raise RuntimeError(f"Robot check served for ASIN {asin}")
soup = BeautifulSoup(html, "lxml")
title_node = soup.select_one("#productTitle")
if not title_node:
raise RuntimeError(f"Product title missing for ASIN {asin}")
price, currency = _parse_price(soup)
return Product(
asin=asin,
title=title_node.get_text(strip=True),
price=price,
currency=currency,
list_price=_parse_list_price(soup),
rating=_parse_rating(soup),
review_count=_parse_review_count(soup),
seller=_parse_seller(soup),
in_stock=_parse_availability(soup),
)
if __name__ == "__main__":
product = scrape_product("B08N5WRWNW")
print(asdict(product))
Run the script with python scrape.py. Pass a residential proxy via the proxy argument in production. The function raises RuntimeError on a robot check, which the caller should catch and retry on a fresh IP.
How Do You Handle the Robot Check Page?
You handle the robot check page with a detector plus an exponential-backoff retry on a fresh proxy IP. The detector inspects the response body for the three markers (Robot Check, the CAPTCHA form text, and the /errors/validateCaptcha path), and the retry loop cycles through a proxy pool before giving up. Solving the CAPTCHA image is almost never the right answer because retries cost one extra request while solves cost $1 to $3 per 1,000 through third-party services and add 10 to 30 seconds of latency per attempt.
import time
from curl_cffi import requests
def fetch_with_retry(asin: str, proxies: list[str], max_attempts: int = 4) -> Product:
for attempt in range(max_attempts):
proxy = proxies[attempt % len(proxies)]
try:
return scrape_product(asin, proxy=proxy)
except RuntimeError as e:
if "Robot check" not in str(e):
raise
time.sleep(2 ** attempt)
raise RuntimeError(f"Failed to scrape {asin} after {max_attempts} attempts")
For a complete treatment of CAPTCHA avoidance, see How to Bypass Amazon Captcha.
How Do You Scrape Amazon Products Without Coding?
You scrape Amazon products without coding by using a no-code scraper tool that provides a visual point-and-click interface plus pre-built Amazon templates, or by pasting ASINs into a Google Sheet that is wired to a scraper API via Apps Script. The no-code path trades flexibility for speed: you cannot easily customize the fields, but you can be running within minutes. Tools in this category include Parsehub, Octoparse, Apify actors, and Google Sheets add-ons like ImportFromWeb.
The Sheets plus API approach is a middle path that requires a few lines of Apps Script but runs inside a spreadsheet most teams already know. The full recipe is in How to Scrape Amazon Data into Google Sheets. The short version: paste ASINs or product URLs in column A, wire up an Apps Script function that calls the Amazon Scraper API per row, and schedule it with a time-based trigger. A 500-ASIN watchlist refreshed daily costs about $13.50 per month on pay-as-you-go billing (or less on Custom).
How Do You Scrape Amazon Products at Scale With an API?
You scrape Amazon products at scale with an API by POSTing ASINs to a managed endpoint that handles proxies, TLS fingerprinting, CAPTCHA detection, and retries on its side, and returns clean HTML or structured JSON. The API is the practical answer for any workload above roughly 50,000 to 100,000 requests per month, because the maintenance cost of a DIY proxy pool and fingerprint library exceeds the $0.50 per 1,000 request price at that volume.
The Amazon Scraper API is purpose-built for this workload. Pricing starts at $0.90 per 1,000 requests on pay-as-you-go and drops to $0.50 per 1,000 on Custom plans; 1,000 free on signup. Concurrency scales from 10 on Free to 50 on Pro and 100-500+ on Custom, enough for 1,500 to 2,500 ASINs per minute on a single account. Coverage includes 19 Amazon marketplaces (US, UK, DE, FR, IT, ES, NL, PL, SE, CA, MX, BR, AU, JP, SG, IN, TR, AE, SA). Median response time is 2.6 seconds per ASIN under steady load. Failed requests (CAPTCHA pages, timeouts) are not billed.
The sync endpoint accepts a single ASIN and returns structured JSON:
import requests
def fetch_via_api(asin: str, api_key: str, marketplace: str = "US") -> dict:
resp = requests.get(
"https://api.amazonscraperapi.com/v1/product",
params={"asin": asin, "marketplace": marketplace, "api_key": api_key},
timeout=30,
)
resp.raise_for_status()
return resp.json()
For batch workloads, the async endpoint accepts up to 1,000 ASINs per POST and delivers the results via webhook, which removes retry orchestration from client code. Pricing starts at $0.90 per 1,000 successful requests on pay-as-you-go and drops to $0.50 per 1,000 on Custom plans - see /pricing for the full matrix.
How Do You Store Scraped Amazon Data?
You store scraped Amazon data in a database that supports time-series queries if you need price history, or a flat CSV or Google Sheet if you only need the latest snapshot. For price monitoring and MAP compliance, time-series is mandatory because the value of the data is in the delta between today and yesterday. For one-off catalog enrichment, a single-snapshot CSV is enough.
Three storage patterns cover most use cases:
- Flat CSV or Google Sheet - simplest for small catalogs under 5,000 ASINs with daily refresh. No infrastructure. Easy to share.
- Postgres or SQLite with a
pricestable keyed on (asin, timestamp) - the right default for in-house price monitoring. Cheap to query and covers a few million price snapshots without any scaling work. - Time-series database (TimescaleDB, InfluxDB, ClickHouse) - for catalogs above 100,000 ASINs with hourly refresh, where the row count crosses 10 million per month. Not needed below that scale.
Whatever the storage, always capture three columns beyond the product fields: scraped_at (UTC timestamp), marketplace, and raw_html or raw_json for a short retention window (24 to 72 hours). The raw HTML lets you re-parse historical snapshots when Amazon ships a selector change and your parser needs a backfill.
How Often Should You Re-Scrape Amazon Prices?
You should re-scrape Amazon prices at the cadence that matches the business decision the data drives: hourly for repricing bots and Buy Box tracking, every 6 hours for competitive monitoring, daily for MAP compliance and catalog sync, weekly for market research. Price-monitoring industry data shows that roughly 37% of professional monitors poll at least hourly and 12% poll at 5-minute intervals, confirming that sub-hourly refresh is the real-time operator default rather than a niche pattern.
Higher frequencies are expensive. A 10,000-ASIN catalog refreshed hourly is 240,000 requests per day, or 7.2 million per month. At $0.50 per 1,000 on the Amazon Scraper API, that is $3,600 per month. The same catalog refreshed daily is $150 per month. Match the cadence to the decision: if a price drop detected 23 hours late still lets you react in time, daily is fine. If a Buy Box change needs to trigger an automated reprice within 15 minutes, you need hourly or better.
FAQ
Does Amazon allow web scraping?
Amazon’s Conditions of Use prohibit automated access, but scraping public product data is not illegal in the US per the hiQ v. LinkedIn ruling. Amazon can ban your IPs and close your Associates account if they detect scraping, but they cannot criminally prosecute you for scraping public listings.
What is the best language for scraping Amazon?
Python is the most popular choice because of curl_cffi, BeautifulSoup, and the scientific Python ecosystem around data analysis. Node.js with Playwright-stealth is the second-most-common stack, especially for teams already in JavaScript. The language matters less than the TLS-fingerprint and proxy stack.
How many Amazon products can you scrape per day?
With a residential proxy pool at 50 concurrent threads, expect 50,000 to 200,000 ASINs per day from a DIY stack. A managed API like the Amazon Scraper API supports 2,000+ ASINs per minute on a single account by default, so a 24-hour run can comfortably exceed 1 million ASINs.
How do you scrape Amazon search results?
Fetch https://www.amazon.com/s?k=KEYWORD through the same curl_cffi plus residential proxy stack, then parse the div[data-component-type="s-search-result"] cards. Each card contains the ASIN, title, price, rating, and product URL. Search pages have slightly tighter rate limits than product detail pages and are more sensitive to request cadence.
How do you extract the ASIN from a URL?
Match /dp/ or /gp/product/ followed by a 10-character alphanumeric ID. A single regex handles all Amazon URL formats: re.search(r"/(?:dp|gp/product)/([A-Z0-9]{10})", url).
Can you scrape Amazon reviews?
Only the 3 to 8 featured reviews and the aggregate rating data after November 5, 2024. The full review list is behind a login wall. See How to Scrape Amazon Reviews With Python for the current selectors and what is still publicly accessible.
How much does it cost to scrape 1 million Amazon products?
At $0.50 per 1,000 successful requests on the Amazon Scraper API, 1 million ASINs costs $500. A DIY stack with a residential proxy pool at $6 per GB and 500 KB per page averages $3 per 1,000 ASINs plus engineering maintenance, so 1 million ASINs costs $3,000 or more in infrastructure alone.
Sources
- hiQ Labs v. LinkedIn - CFAA precedent for public-data scraping - https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
- curl_cffi library - TLS impersonation for Python - https://github.com/lexiforest/curl_cffi
- AWS WAF CAPTCHA and Challenge documentation - https://docs.aws.amazon.com/waf/latest/developerguide/waf-captcha-and-challenge.html
- Proxies.sx: Best Proxies for Amazon Scraping 2026 - https://www.proxies.sx/blog/best-proxies-amazon-scraping-2026
Schema Suggestions
HowToschema for the main scraping process (setup, fetch, detect, parse)FAQPageschema for the FAQ section with the seven Q&A pairsArticleschema withauthor,datePublished, andabout= “Amazon product scraping”