Puppeteer

Why Puppeteer is selected as the page scraping framework in modern Web data collection scenarios within the Node.js ecosystem, and the platform’s officially recommended usage patterns.

1. Positioning and Role of Puppeteer

Puppeteer is a Node.js-based browser automation framework that directly controls Chromium browsers via the DevTools Protocol (CDP).

Its core capabilities include:

Real Chromium browser control
JavaScript execution and full page rendering
DOM manipulation and event simulation
Native WebSocket (CDP) support for connecting to remote browsers
Well suited for page scraping and automation tasks in the Node.js ecosystem

Puppeteer does not simulate browser requests.

It directly drives a real browser to execute page logic.

2. Officially Recommended Implementation

1️⃣ Connecting to a Remote Fingerprint Browser

let auth = null
try {
    auth = process.env.PROXY_AUTH || null
    await coresdk.log.info(`Browser authentication info: ${auth}`)
} catch (err) {
    await coresdk.log.error(
        `Failed to obtain browser authentication info: ${err.message}`
    )
    auth = null
}

// Fingerprint browser endpoint (read from environment variable for flexible deployment)
const chromeWs = process.env.ChromeWs || 'chrome-ws-inner.coreclaw.com'
await coresdk.log.info(`Chrome WebSocket endpoint: ${chromeWs}`)

let browser_url = `ws://${auth}@${chromeWs}`
await coresdk.log.info(`Fingerprint browser endpoint: ${browser_url}`)

url = inputJson?.url
await coresdk.log.info(`Processing URL: ${url}`)

let browser = await puppeteer.connect({
    browserWSEndpoint: browser_url,
    defaultViewport: null, // Disable Puppeteer's default viewport
})

const page = await browser.newPage()
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 })
const html = await page.content()

let result = {
    url: url,
    html: html,
    resp_status: '200',
}

3. Complete Platform Script Entry Example (Recommended)

#!/usr/bin/env node
'use strict'

const coresdk = require('./sdk')
const puppeteer = require('puppeteer-core')

async function run() {
    let url = ''
    try {
        // 1. Define table headers
        const headers = [
            { label: 'url', key: 'url', format: 'text' },
            { label: 'html', key: 'html', format: 'text' },
            { label: 'resp_status', key: 'resp_status', format: 'text' },
        ]

        await coresdk.result.setTableHeader(headers)

        // 2. Retrieve input parameters
        const inputJson = await coresdk.parameter.getInputJSONObject()
        await coresdk.log.debug(
            `Input parameters: ${JSON.stringify(inputJson)}`
        )

        // 3. Obtain fingerprint browser authentication
        let auth = null
        try {
            auth = process.env.PROXY_AUTH || null
            await coresdk.log.info(`Browser authentication info: ${auth}`)
        } catch (err) {
            await coresdk.log.error(
                `Failed to obtain browser authentication info: ${err.message}`
            )
            auth = null
        }

        // Fingerprint browser endpoint (read from environment variable for flexible deployment)
        const chromeWs = process.env.ChromeWs || 'chrome-ws-inner.coreclaw.com'
        await coresdk.log.info(`Chrome WebSocket endpoint: ${chromeWs}`)

        let browser_url = `ws://${auth}@${chromeWs}`
        await coresdk.log.info(`Fingerprint browser endpoint: ${browser_url}`)

        // 4. Business logic
        url = inputJson?.url
        await coresdk.log.info(`Processing URL: ${url}`)

        let browser = await puppeteer.connect({
            browserWSEndpoint: browser_url,
            defaultViewport: null,
        })

        const page = await browser.newPage()
        await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 })
        const html = await page.content()

        let result = {
            url: url,
            html: html,
            resp_status: '200',
        }

        // 5. Push results to the platform
        await coresdk.result.pushData(result)

        await coresdk.log.info('Script execution completed')
    } catch (err) {
        await coresdk.log.error(`Script execution error: ${err.message}`)

        const errorResult = {
            url: url,
            html: err.message,
            resp_status: '500',
        }

        await coresdk.result.pushData(errorResult)
        throw err
    }
}

run()

4. Dynamic Content Handling and DOM Operations

Retrieving a Single Element

// Method 1: CSS selectors (recommended)
const element = await page.$('.product-title')
const element = await page.$('#main-content')
const element = await page.$('h1')

// Method 2: XPath
const element = await page.$x('//div[@class="container"]')
const [element] = await page.$x('//button[contains(text(), "Submit")]')

// Method 3: Text-based selection (via XPath)
const [element] = await page.$x('//*[contains(text(), "Buy Now")]')

// Method 4: Other selectors
const element = await page.$('[name="username"]')
const element = await page.$('a[href*="download"]')

// Check existence and retrieve attributes
if (element) {
    const text = await page.evaluate((el) => el.textContent, element)
    const html = await page.evaluate((el) => el.outerHTML, element)
    const className = await page.evaluate((el) => el.className, element)
    const href = await page.evaluate((el) => el.href, element)
    const isVisible = await element.isVisible()
}

// Wait for elements
await page.waitForSelector('.product-title', { timeout: 10000 })
await page.waitForXPath('//div[@class="container"]')

// Retrieve element handle via evaluateHandle
const elementHandle = await page.evaluateHandle(() => {
    return document.querySelector('.product-title')
})

Batch Element Processing

// Retrieve all matching elements
const productItems = await page.$$('.product-item')
console.log(`Found ${productItems.length} products`)

// Method 1: Iterative processing (recommended)
const productsData = []
for (const item of productItems) {
    const nameElem = await item.$('.name')
    const priceElem = await item.$('.price')
    const linkElem = await item.$('.link')

    const product = {
        name: nameElem
            ? await page.evaluate((el) => el.textContent.trim(), nameElem)
            : '',
        price: priceElem
            ? await page.evaluate((el) => el.textContent.trim(), priceElem)
            : '',
        link: linkElem ? await page.evaluate((el) => el.href, linkElem) : '',
    }
    productsData.push(product)
}

// Method 2: Batch processing with evaluate (most efficient)
const productsData = await page.evaluate(() => {
    const items = document.querySelectorAll('.product-item')
    return Array.from(items).map((item) => {
        const nameElem = item.querySelector('.name')
        const priceElem = item.querySelector('.price')
        const linkElem = item.querySelector('.link')
        return {
            name: nameElem ? nameElem.textContent.trim() : '',
            price: priceElem ? priceElem.textContent.trim() : '',
            link: linkElem ? linkElem.href : '',
        }
    })
})

// Method 3: Using map + Promise.all
const productsData = await Promise.all(
    productItems.map(async (item) => {
        const [name, price, link] = await Promise.all([
            item.$eval('.name', (el) => el.textContent.trim()).catch(() => ''),
            item.$eval('.price', (el) => el.textContent.trim()).catch(() => ''),
            item.$eval('.link', (el) => el.href).catch(() => ''),
        ])
        return { name, price, link }
    })
)

// Simplified $$eval usage (uniform structure)
const names = await page.$$eval('.product-item .name', (elements) =>
    elements.map((el) => el.textContent.trim())
)

Advantages:

Operates on real browser DOM
Direct access to JavaScript-rendered content
Fully aligned with front-end execution logic

5. Officially Discouraged Practices (Anti-Patterns)

❌ Using Fixed sleep to Wait for Page Load

await new Promise((r) => setTimeout(r, 5000))

Issues:

Does not guarantee JavaScript execution completion
Fails on slow pages
Wastes time on fast pages

❌ Using requests / fetch to Simulate Page Requests

fetch(url)

Issues:

Incomplete page content
Easily detected by anti-bot systems
Unpredictable success rate

Puppeteer

1. Positioning and Role of Puppeteer

2. Officially Recommended Implementation

1️⃣ Connecting to a Remote Fingerprint Browser

2️⃣ Page Navigation and Content Retrieval

3. Complete Platform Script Entry Example (Recommended)

4. Dynamic Content Handling and DOM Operations

Retrieving a Single Element

Batch Element Processing

5. Officially Discouraged Practices (Anti-Patterns)

❌ Using Fixed sleep to Wait for Page Load

❌ Using requests / fetch to Simulate Page Requests