Amazon Scraper API

How to Scrape Amazon Reviews With Python in 2026

Updated at

The Answer

Scraping Amazon reviews is a different problem than it was in 2023. On November 5, 2024 Amazon moved nearly all reviews behind a login wall, and the /product-reviews/ endpoint now redirects to a sign-in page for unauthenticated traffic. What stays publicly scrapable is the 3 to 8 featured reviews on the main product detail page plus the aggregate rating, total review count, and the 5-star to 1-star distribution. This article shows how to pull that data with Python requests and BeautifulSoup, what the working selectors are, and when to switch to the Amazon Scraper API for bulk jobs where a DIY scraper breaks down.

What Changed With Amazon Reviews in 2024?

On November 5, 2024, Amazon locked most product reviews behind a login wall, and any unauthenticated request to amazon.com/product-reviews/<ASIN> now redirects to a sign-in page. This is the single biggest fact that shapes any review-scraping project. Before the lockdown, a scraper could paginate through every review on any product. After it, unauthenticated scrapers see only the handful of reviews Amazon elects to show on the main detail page.

Medium writer Lennart Biesel documented the change within hours, and every scraping service since has confirmed the same behavior: no amount of proxies, headers, or technical tricks bypasses the login redirect on the product-reviews pages.

Everything below assumes you are scraping what is still publicly available, which is a smaller but still useful dataset. If your use case requires every review ever posted, you will need a different approach (enterprise data partnerships, first-party integrations, or Amazon’s own seller APIs for products you own).

What Review Data Is Still Publicly Scrapable?

The publicly scrapable review data is the featured review block on the product detail page, the aggregate star rating, the total review count, and the percent distribution across 5-star through 1-star buckets. Amazon surfaces between three and eight featured reviews on the product page depending on category and device class, and those reviews render in the raw HTML without any authentication.

Specifically you can still extract:

  • Aggregate rating - the average star rating, rendered in #acrPopover
  • Total review count - rendered in #acrCustomerReviewText
  • Star distribution - the 5/4/3/2/1-star percentages, rendered in #histogramTable
  • Featured reviews (3 to 8 per page) - each with title, author, rating, date, body text, and verified-purchase badge
  • “Top review” highlights - short excerpts Amazon chooses to feature

What you can no longer extract without authenticated access:

  • The full review list beyond the featured block
  • Per-star-filtered review pages
  • Review helpful-votes and comment counts for non-featured reviews
  • Reviewer profile pages

For most product-intelligence use cases (tracking sentiment drift, catching sudden rating drops, sampling recent customer complaints), the featured block is enough. For exhaustive review analysis, it is not.

What Do You Need to Scrape Amazon Reviews With Python?

You need Python 3.9 or newer, the requests and beautifulsoup4 packages, and a residential or mobile proxy if you plan to make more than a handful of calls. Amazon serves robot-check pages to datacenter IPs at roughly a 30 percent rate, and to unmaintained residential IPs at 5 to 15 percent. A clean scrape-to-parse pipeline needs a way to detect and handle those robot pages.

pip install requests beautifulsoup4 lxml

A realistic browser User-Agent string is also essential. Amazon’s perimeter fingerprints the default python-requests/2.x header within the first request and will block it. The User-Agent below matches a current Safari on macOS and has held up in production scrapers for months.

USER_AGENT = (
 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
 "AppleWebKit/605.1.15 (KHTML, like Gecko) "
 "Version/17.6 Safari/605.1.15"
)

If you cannot or do not want to maintain proxy infrastructure, the Amazon Scraper API is the alternative that handles rotation, TLS fingerprinting, and CAPTCHA retry server-side. The rest of this guide walks through the DIY path first, then the managed path.

How Do You Fetch the Product Page in Python?

You fetch the product page with a single requests.get call to https://www.amazon.com/dp/<ASIN>, sending a browser User-Agent and checking the response for Amazon’s robot-check markers. The review block you need lives inside the same HTML as the product page, so there is no separate reviews fetch.

import requests

HEADERS = {
 "User-Agent": USER_AGENT,
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
 "Accept-Language": "en-US,en;q=0.9",
 "Accept-Encoding": "gzip, deflate, br",
}

ROBOT_MARKERS = (
 "captchacharacters",
 "Enter the characters you see below",
 "To discuss automated access",
 "Robot Check",
)

class AmazonBlocked(RuntimeError):
 pass

def fetch_product_page(asin: str, proxy_url: str | None = None) -> str:
 url = f"https://www.amazon.com/dp/{asin}"
 proxies = {"http": proxy_url, "https": proxy_url} if proxy_url else None
 resp = requests.get(url, headers=HEADERS, proxies=proxies, timeout=30)
 resp.raise_for_status()
 if any(marker in resp.text for marker in ROBOT_MARKERS):
 raise AmazonBlocked(f"Robot check on {url}")
 return resp.text

The ROBOT_MARKERS check is non-optional. Amazon returns HTTP 200 for its CAPTCHA page, so a naive scraper that only checks the status code will happily hand parsed-empty results to downstream code without ever failing loudly.

You parse featured reviews by selecting elements marked with Amazon’s data-hook attributes, which are the most stable selectors on the page. Class names like .a-row and .a-size-base drift every few weeks, but data-hook values (review, review-title, review-body, review-date, review-star-rating) have been stable for years because Amazon uses them internally for analytics.

from dataclasses import dataclass, field
from bs4 import BeautifulSoup
import re

@dataclass
class Review:
 author: str | None
 rating: float | None
 title: str | None
 date: str | None
 body: str | None
 verified_purchase: bool

@dataclass
class ReviewSummary:
 asin: str
 average_rating: float | None
 total_count: int | None
 star_distribution: dict[str, float] = field(default_factory=dict)
 featured_reviews: list[Review] = field(default_factory=list)

def parse_reviews(html: str, asin: str) -> ReviewSummary:
 soup = BeautifulSoup(html, "lxml")
 summary = ReviewSummary(
 asin=asin,
 average_rating=_parse_average_rating(soup),
 total_count=_parse_total_count(soup),
 star_distribution=_parse_star_distribution(soup),
 featured_reviews=_parse_featured_reviews(soup),
 )
 return summary

def _parse_average_rating(soup: BeautifulSoup) -> float | None:
 el = soup.select_one("#acrPopover")
 text = el.get("title") if el else None
 if not text:
 alt = soup.select_one(".a-icon-star.a-icon-alt")
 text = alt.get_text() if alt else None
 if not text:
 return None
 m = re.search(r"([0-9](?:\.[0-9])?)", text)
 return float(m.group(1)) if m else None

def _parse_total_count(soup: BeautifulSoup) -> int | None:
 el = soup.select_one("#acrCustomerReviewText")
 if not el:
 return None
 digits = re.sub(r"[^0-9]", "", el.get_text())
 return int(digits) if digits else None

def _parse_star_distribution(soup: BeautifulSoup) -> dict[str, float]:
 out: dict[str, float] = {}
 table = soup.select_one("#histogramTable")
 if not table:
 return out
 for row in table.select("tr"):
 label = row.select_one("td.aok-nowrap a")
 pct = row.select_one("td.a-text-right a, td._cr-ratings-histogram_style_histogram-column-space__RKUAd")
 if label and pct:
 key = label.get_text(strip=True).replace(" star", "")
 m = re.search(r"([0-9]+)%", pct.get_text())
 if m:
 out[key] = float(m.group(1)) / 100.0
 return out

def _parse_featured_reviews(soup: BeautifulSoup) -> list[Review]:
 out: list[Review] = []
 for card in soup.select('[data-hook="review"]'):
 title_el = card.select_one('[data-hook="review-title"] span:not([class])') or card.select_one('[data-hook="review-title"]')
 body_el = card.select_one('[data-hook="review-body"] span')
 date_el = card.select_one('[data-hook="review-date"]')
 rating_el = card.select_one('[data-hook="review-star-rating"], [data-hook="cmps-review-star-rating"]')
 author_el = card.select_one('.a-profile-name')
 verified_el = card.select_one('[data-hook="avp-badge"]')

 rating: float | None = None
 if rating_el:
 m = re.search(r"([0-9](?:\.[0-9])?)", rating_el.get_text())
 if m:
 rating = float(m.group(1))

 out.append(Review(
 author=author_el.get_text(strip=True) if author_el else None,
 rating=rating,
 title=title_el.get_text(strip=True) if title_el else None,
 date=date_el.get_text(strip=True) if date_el else None,
 body=body_el.get_text(" ", strip=True) if body_el else None,
 verified_purchase=bool(verified_el),
 ))
 return out

The data-hook selectors come directly from Amazon’s own review widgets, which is why they are more resilient than class-based selectors. When Amazon refactors the CSS (which happens every few weeks), data-hook values almost always survive the refactor because they are referenced by Amazon’s own JavaScript and A/B testing pipelines.

How Do You Put the Scraper Together End to End?

You put the scraper together by chaining fetch_product_page into parse_reviews and handling the two failure modes that matter in production: robot blocks and missing data.

import json
import os
import sys
from dataclasses import asdict

def scrape_reviews(asin: str, proxy_url: str | None = None) -> ReviewSummary:
 html = fetch_product_page(asin, proxy_url=proxy_url)
 return parse_reviews(html, asin=asin)

def main() -> int:
 if len(sys.argv) < 2:
 print("Usage: python amazon_reviews.py <ASIN>", file=sys.stderr)
 return 1
 asin = sys.argv[1]
 proxy_url = os.environ.get("PROXY_URL")
 try:
 summary = scrape_reviews(asin, proxy_url=proxy_url)
 except AmazonBlocked as e:
 print(f"blocked: {e}", file=sys.stderr)
 return 2
 print(json.dumps(asdict(summary), indent=2, ensure_ascii=False))
 return 0

if __name__ == "__main__":
 sys.exit(main())

Run it against a product that has reviews and you get back JSON with the aggregate rating, total count, star distribution, and the three to eight featured reviews. Run it against an ASIN with zero reviews and you get the same structure with average_rating: null, total_count: null, and an empty featured_reviews list. The scraper does not crash on missing fields, which matters because review-less listings are common.

Why Does the Scraper Break at Volume?

The scraper breaks at volume because Amazon rate-limits per IP, rotates its CSS classes every few weeks, and serves robot-check pages stochastically even to residential proxies. A single developer machine can reliably pull 50 to 100 product pages per hour before Amazon starts sending CAPTCHAs on the majority of requests. A rotating residential proxy pool extends that to a few thousand per hour before the same thing happens.

Three specific failure modes appear in any DIY scraper running at more than toy scale:

  • Robot-check pages return HTTP 200 with CAPTCHA HTML. Your parser returns empty data unless you explicitly detect the marker strings.
  • TLS fingerprinting catches Python’s default requests library because requests uses OpenSSL defaults that look different from a real Chrome or Safari handshake. The curl_cffi library impersonates browser TLS stacks and is the go-to fix, but it is another dependency to maintain.
  • Selector churn on the reviews block drops your extraction rate silently. A monthly test run against a known-good ASIN with known-good review count is the only reliable detection.

For a homework project or a scraper that runs a dozen ASINs a day, none of this matters. For a product-intelligence pipeline running a thousand ASINs daily for a brand-analytics dashboard, all three of these failure modes will hit you within a week.

When Should You Use a Managed Amazon Reviews API?

You should use a managed Amazon reviews API when your volume exceeds what a self-maintained scraper can reliably deliver, which in practice is anything above a few hundred ASINs per day with repeatability guarantees. The break-even point is usually labor cost rather than proxy cost. Maintaining residential proxies, CAPTCHA detection, selector updates, and TLS fingerprint rotation is roughly 10 to 20 percent of a back-end engineer’s week over any sustained period.

The Amazon Scraper API returns the featured review block, aggregate rating, star distribution, and total count as structured JSON on a single call to its product endpoint. The response already includes the reviews Amazon still exposes publicly, so you do not pay extra per review. Pricing starts at $0.90 per 1,000 requests on pay-as-you-go and drops to $0.50 per 1,000 on Custom plans, with 1,000 free on signup and no credit card. For higher volume, the async batch endpoint accepts up to 1,000 ASINs per POST and delivers JSON via webhook once processing finishes. Median product-endpoint latency on the provider’s own benchmarks is around 2.6 seconds.

The decision tree is straightforward:

  • Under 100 ASINs per day, one-time project - Use the Python scraper above.
  • 100 to 500 ASINs per day, short-term project - Python scraper with a residential proxy pool. Budget time for selector maintenance.
  • 500+ ASINs per day, or a product you are shipping to customers - Use the managed API. The DIY cost of reliability is higher than the $0.50 per 1,000.

Scraping publicly available Amazon review data is generally legal in the United States under the 2022 Ninth Circuit ruling in hiQ Labs v. LinkedIn, which held that scraping publicly accessible data is not a violation of the Computer Fraud and Abuse Act. That ruling applies only to data that is actually public, which in Amazon’s case means the featured review block and aggregate metrics visible without logging in.

Scraping data behind the review login wall is a different problem. Authenticated scraping puts you under Amazon’s Terms of Service, which explicitly prohibit automated data extraction by any logged-in account. That is a contract breach, not a criminal issue, but Amazon has suspended seller accounts and revoked API access over it.

The safe posture is to scrape only what Amazon exposes publicly, respect robots.txt where it applies, and never send authenticated traffic through a scraper. If your use case requires every review on a product, you need to either work with Amazon directly through Vendor Central or partner with a compliant data vendor.

FAQ

Can I scrape Amazon reviews without getting blocked?

You can scrape a small number of Amazon reviews without getting blocked if you use a residential proxy, a realistic User-Agent, and pause between requests. Beyond a few hundred requests per hour from a single IP, robot-check pages are effectively unavoidable without a rotating proxy pool.

Does the login wall apply to all Amazon marketplaces?

The login wall applies to amazon.com, amazon.co.uk, amazon.de, and every other marketplace tested since November 2024. The featured review block on the main detail page remains visible across all of them.

Can I scrape reviews for my own products with the SP-API?

Amazon’s SP-API does not expose customer review content, even for your own products. Sellers can see aggregate review metrics in Brand Analytics, but the individual review text is not available through any official API.

How often do the selectors change?

Amazon refactors review-block CSS roughly every 4 to 8 weeks. The data-hook attributes (review, review-title, review-body, review-date) are more durable than class names and have held since well before the 2024 lockdown.

What is the difference between scraping reviews and scraping product data?

Scraping reviews pulls the featured-review block and aggregate metrics from the product page. Scraping product data pulls title, price, rating, availability, images, and bullet points. Both sets of data live on the same /dp/<ASIN> page, so one fetch returns both.