Puppeteer
Why Puppeteer is selected as the page scraping framework in modern Web data collection scenarios within the Node.js ecosystem, and the platform’s officially recommended usage patterns.
1. Positioning and Role of Puppeteer
Section titled “1. Positioning and Role of Puppeteer”Puppeteer is a Node.js-based browser automation framework that directly controls Chromium browsers via the DevTools Protocol (CDP).
Its core capabilities include:
- Real Chromium browser control
- JavaScript execution and full page rendering
- DOM manipulation and event simulation
- Native WebSocket (CDP) support for connecting to remote browsers
- Well suited for page scraping and automation tasks in the Node.js ecosystem
Puppeteer does not simulate browser requests.
It directly drives a real browser to execute page logic.
2. Officially Recommended Implementation
Section titled “2. Officially Recommended Implementation”1️⃣ Connecting to a Remote Fingerprint Browser
Section titled “1️⃣ Connecting to a Remote Fingerprint Browser”let auth = nulltry { auth = process.env.PROXY_AUTH || null await coresdk.log.info(`Browser authentication info: ${auth}`)} catch (err) { await coresdk.log.error( `Failed to obtain browser authentication info: ${err.message}` ) auth = null}
// Fingerprint browser endpoint (read from environment variable for flexible deployment)const chromeWs = process.env.ChromeWs || 'chrome-ws-inner.coreclaw.com'await coresdk.log.info(`Chrome WebSocket endpoint: ${chromeWs}`)
let browser_url = `ws://${auth}@${chromeWs}`await coresdk.log.info(`Fingerprint browser endpoint: ${browser_url}`)2️⃣ Page Navigation and Content Retrieval
Section titled “2️⃣ Page Navigation and Content Retrieval”url = inputJson?.urlawait coresdk.log.info(`Processing URL: ${url}`)
let browser = await puppeteer.connect({ browserWSEndpoint: browser_url, defaultViewport: null, // Disable Puppeteer's default viewport})
const page = await browser.newPage()await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 })const html = await page.content()
let result = { url: url, html: html, resp_status: '200',}3. Complete Platform Script Entry Example (Recommended)
Section titled “3. Complete Platform Script Entry Example (Recommended)”#!/usr/bin/env node'use strict'
const coresdk = require('./sdk')const puppeteer = require('puppeteer-core')
async function run() { let url = '' try { // 1. Define table headers const headers = [ { label: 'url', key: 'url', format: 'text' }, { label: 'html', key: 'html', format: 'text' }, { label: 'resp_status', key: 'resp_status', format: 'text' }, ]
await coresdk.result.setTableHeader(headers)
// 2. Retrieve input parameters const inputJson = await coresdk.parameter.getInputJSONObject() await coresdk.log.debug( `Input parameters: ${JSON.stringify(inputJson)}` )
// 3. Obtain fingerprint browser authentication let auth = null try { auth = process.env.PROXY_AUTH || null await coresdk.log.info(`Browser authentication info: ${auth}`) } catch (err) { await coresdk.log.error( `Failed to obtain browser authentication info: ${err.message}` ) auth = null }
// Fingerprint browser endpoint (read from environment variable for flexible deployment) const chromeWs = process.env.ChromeWs || 'chrome-ws-inner.coreclaw.com' await coresdk.log.info(`Chrome WebSocket endpoint: ${chromeWs}`)
let browser_url = `ws://${auth}@${chromeWs}` await coresdk.log.info(`Fingerprint browser endpoint: ${browser_url}`)
// 4. Business logic url = inputJson?.url await coresdk.log.info(`Processing URL: ${url}`)
let browser = await puppeteer.connect({ browserWSEndpoint: browser_url, defaultViewport: null, })
const page = await browser.newPage() await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 }) const html = await page.content()
let result = { url: url, html: html, resp_status: '200', }
// 5. Push results to the platform await coresdk.result.pushData(result)
await coresdk.log.info('Script execution completed') } catch (err) { await coresdk.log.error(`Script execution error: ${err.message}`)
const errorResult = { url: url, html: err.message, resp_status: '500', }
await coresdk.result.pushData(errorResult) throw err }}
run()4. Dynamic Content Handling and DOM Operations
Section titled “4. Dynamic Content Handling and DOM Operations”Retrieving a Single Element
Section titled “Retrieving a Single Element”// Method 1: CSS selectors (recommended)const element = await page.$('.product-title')const element = await page.$('#main-content')const element = await page.$('h1')
// Method 2: XPathconst element = await page.$x('//div[@class="container"]')const [element] = await page.$x('//button[contains(text(), "Submit")]')
// Method 3: Text-based selection (via XPath)const [element] = await page.$x('//*[contains(text(), "Buy Now")]')
// Method 4: Other selectorsconst element = await page.$('[name="username"]')const element = await page.$('a[href*="download"]')
// Check existence and retrieve attributesif (element) { const text = await page.evaluate((el) => el.textContent, element) const html = await page.evaluate((el) => el.outerHTML, element) const className = await page.evaluate((el) => el.className, element) const href = await page.evaluate((el) => el.href, element) const isVisible = await element.isVisible()}
// Wait for elementsawait page.waitForSelector('.product-title', { timeout: 10000 })await page.waitForXPath('//div[@class="container"]')
// Retrieve element handle via evaluateHandleconst elementHandle = await page.evaluateHandle(() => { return document.querySelector('.product-title')})Batch Element Processing
Section titled “Batch Element Processing”// Retrieve all matching elementsconst productItems = await page.$$('.product-item')console.log(`Found ${productItems.length} products`)
// Method 1: Iterative processing (recommended)const productsData = []for (const item of productItems) { const nameElem = await item.$('.name') const priceElem = await item.$('.price') const linkElem = await item.$('.link')
const product = { name: nameElem ? await page.evaluate((el) => el.textContent.trim(), nameElem) : '', price: priceElem ? await page.evaluate((el) => el.textContent.trim(), priceElem) : '', link: linkElem ? await page.evaluate((el) => el.href, linkElem) : '', } productsData.push(product)}
// Method 2: Batch processing with evaluate (most efficient)const productsData = await page.evaluate(() => { const items = document.querySelectorAll('.product-item') return Array.from(items).map((item) => { const nameElem = item.querySelector('.name') const priceElem = item.querySelector('.price') const linkElem = item.querySelector('.link') return { name: nameElem ? nameElem.textContent.trim() : '', price: priceElem ? priceElem.textContent.trim() : '', link: linkElem ? linkElem.href : '', } })})
// Method 3: Using map + Promise.allconst productsData = await Promise.all( productItems.map(async (item) => { const [name, price, link] = await Promise.all([ item.$eval('.name', (el) => el.textContent.trim()).catch(() => ''), item.$eval('.price', (el) => el.textContent.trim()).catch(() => ''), item.$eval('.link', (el) => el.href).catch(() => ''), ]) return { name, price, link } }))
// Simplified $$eval usage (uniform structure)const names = await page.$$eval('.product-item .name', (elements) => elements.map((el) => el.textContent.trim()))Advantages:
- Operates on real browser DOM
- Direct access to JavaScript-rendered content
- Fully aligned with front-end execution logic
5. Officially Discouraged Practices (Anti-Patterns)
Section titled “5. Officially Discouraged Practices (Anti-Patterns)”❌ Using Fixed sleep to Wait for Page Load
Section titled “❌ Using Fixed sleep to Wait for Page Load”await new Promise((r) => setTimeout(r, 5000))Issues:
- Does not guarantee JavaScript execution completion
- Fails on slow pages
- Wastes time on fast pages
❌ Using requests / fetch to Simulate Page Requests
Section titled “❌ Using requests / fetch to Simulate Page Requests”fetch(url)Issues:
- Incomplete page content
- Easily detected by anti-bot systems
- Unpredictable success rate