Python Example
Learn how to build a Worker using Python.
GitHub Repository
Section titled “GitHub Repository”Python Script Demo Repository: Python-Worker-Demo
Required Files (Project Root Directory)
Section titled “Required Files (Project Root Directory)”├── main.py # Script entry file├── requirements.txt # Python dependencies├── input_schema.json # Input form configuration├── output_schema.json # Output table configuration├── sdk.py # CoreClaw SDK - Core functionality├── sdk_pb2.py # Data processing module└── sdk_pb2_grpc.py # Network communication moduleFile Overview
Section titled “File Overview”| File | Description |
|---|---|
| main.py | Script entry file (execution entry), must be named main |
| requirements.txt | Python dependency management file |
| input_schema.json | UI input form configuration file |
| output_schema.json | Output table structure configuration file |
| sdk.py | Core SDK functionality |
| sdk_pb2.py | Enhanced data processing module |
| sdk_pb2_grpc.py | Network communication module |
These three SDK files (sdk.py, sdk_pb2.py, sdk_pb2_grpc.py) are required and must be placed in the root directory of the project. Together they form the script’s toolbox, providing all essential capabilities for Worker execution and interaction with the platform backend.
Core SDK Usage
Section titled “Core SDK Usage”The CoreClaw SDK (CoreSDK) provides three core capabilities that every Worker needs:
1. Parameter Retrieval — Get Input Configuration
Section titled “1. Parameter Retrieval — Get Input Configuration”When a Worker starts, the platform passes input parameters (such as URLs, keywords, etc.). Use the following method to retrieve them:
from sdk import CoreSDK
# Get all input parameters as a dictionaryinput_json_dict = CoreSDK.Parameter.get_input_json_dict()
# Example: retrieve a specific parameterurl = input_json_dict.get('url')Use case: Pass different parameters for different tasks without modifying code.
2. Logging — Record Execution Progress
Section titled “2. Logging — Record Execution Progress”Record different levels of log messages during execution. These logs appear in the Console, making it easy to monitor status and debug issues:
# Debug info (most detailed, for troubleshooting)CoreSDK.Log.debug("Connecting to target website...")
# General info (normal process recording)CoreSDK.Log.info("Successfully retrieved 10 data items")
# Warning (notable but non-error situations)CoreSDK.Log.warn("Slow network connection, may affect speed")
# Error (execution failures)CoreSDK.Log.error("Cannot access target website")Log levels:
- debug — Most detailed, ideal for development
- info — Normal process recording, recommended for key steps
- warn — Warning, indicates potential issues
- error — Error, requires attention
3. Result Output — Push Data Back to Platform
Section titled “3. Result Output — Push Data Back to Platform”After collecting data, push it back to the platform in two steps:
Step 1: Set Table Headers
Section titled “Step 1: Set Table Headers”Define the table structure before pushing data, similar to defining column headers in a spreadsheet:
headers = [ {"label": "Title", "key": "title", "format": "text"}, {"label": "URL", "key": "url", "format": "text"}, {"label": "Category", "key": "category", "format": "text"},]CoreSDK.Result.set_table_header(headers)Field descriptions:
- label — Column title displayed to users
- key — Unique identifier used in code (match with push_data keys)
- format — Data type:
"text","integer","boolean","array","object"
Step 2: Push Data Row by Row
Section titled “Step 2: Push Data Row by Row”Push each collected data item individually:
for item in collected_data: obj = { "title": item.get("title"), "url": item.get("url"), "category": item.get("category"), } CoreSDK.Result.push_data(obj)Important:
- Set table headers before pushing data
- Keys in push_data must match keys in table headers exactly
- Data must be pushed one row at a time
- Add logging after each push to track progress
Step 3: Upsert Data (Update or Insert)
Section titled “Step 3: Upsert Data (Update or Insert)”Use upsert_data to update existing records or insert new ones based on a unique key. This is useful when you need to re-scrape and update previously collected data:
data = { "id": "test-1", "title": "Updated Title", "description": "Updated description",}CoreSDK.Result.upsert_data(data, "id")How it works:
- If a record with the same unique key exists, it will be updated
- If no matching record is found, a new record will be inserted
- The unique key must exist in the data dictionary
- Important: The unique key field must also be defined in
output_schema.json, or the platform cannot match and update rows correctly
Script Entry File (main.py)
Section titled “Script Entry File (main.py)”Synchronous vs Asynchronous
Section titled “Synchronous vs Asynchronous”CoreClaw supports both synchronous and asynchronous styles for Python Workers. Choose the one that best fits your needs:
| Style | Entry Point | Best For |
|---|---|---|
| Synchronous | def main(): | Simple scripts, sequential execution |
| Asynchronous | async def run(): + asyncio.run(run()) | Concurrent I/O, async libraries (aiohttp, etc.) |
Synchronous example (recommended for beginners):
#!/usr/bin/env python3# -*- coding: utf-8 -*-import osfrom sdk import CoreSDK
def main(): try: # 1. Get input parameters input_json_dict = CoreSDK.Parameter.get_input_json_dict() CoreSDK.Log.debug(f"Input parameters: {input_json_dict}")
# 2. Proxy configuration (read from environment variables) proxy_auth = os.environ.get("PROXY_AUTH") CoreSDK.Log.info(f"Proxy auth: {proxy_auth}")
# 3. Business logic url = input_json_dict.get('url') CoreSDK.Log.info(f"Processing URL: {url}")
result = { "url": url, "status": "success", }
# 4. Set table headers headers = [ {"label": "URL", "key": "url", "format": "text"}, {"label": "Status", "key": "status", "format": "text"}, ] CoreSDK.Result.set_table_header(headers)
# 5. Push result data CoreSDK.Result.push_data(result)
CoreSDK.Log.info("Script execution completed")
except Exception as e: CoreSDK.Log.error(f"Execution error: {e}") CoreSDK.Result.push_data({ "error": str(e), "error_code": "500", "status": "failed" }) raise
if __name__ == "__main__": main()Asynchronous example (for advanced use cases):
#!/usr/bin/env python3# -*- coding: utf-8 -*-import asyncioimport osfrom sdk import CoreSDK
async def run(): try: # 1. Get input parameters input_json_dict = CoreSDK.Parameter.get_input_json_dict() CoreSDK.Log.debug(f"Input parameters: {input_json_dict}")
# 2. Proxy configuration (read from environment variables) proxy_auth = os.environ.get("PROXY_AUTH") CoreSDK.Log.info(f"Proxy auth: {proxy_auth}")
# 3. Business logic url = input_json_dict.get('url') CoreSDK.Log.info(f"Processing URL: {url}")
result = { "url": url, "status": "success", }
# 4. Set table headers headers = [ {"label": "URL", "key": "url", "format": "text"}, {"label": "Status", "key": "status", "format": "text"}, ] CoreSDK.Result.set_table_header(headers)
# 5. Push result data CoreSDK.Result.push_data(result)
CoreSDK.Log.info("Script execution completed")
except Exception as e: CoreSDK.Log.error(f"Execution error: {e}") CoreSDK.Result.push_data({ "error": str(e), "error_code": "500", "status": "failed" }) raise
if __name__ == "__main__": asyncio.run(run())How It Works
Section titled “How It Works”The script follows four stages:
- Receive instructions — Get input parameters (URLs, keywords, etc.) from the platform
- Network setup — Configure proxy via
PROXY_AUTHenvironment variable for accessing external websites - Execute task — Run the core scraping logic on target pages
- Report results — Set table headers first, then push collected data back to the platform
Python Dependency Management (requirements.txt)
Section titled “Python Dependency Management (requirements.txt)”This file lists all third-party Python packages required to run the script. The platform automatically installs all dependencies specified in this file.
Example
Section titled “Example”aiofiles==25.1.0certifi==2025.11.12cffi==2.0.0cssselect==1.3.0curl_cffi==0.13.0grpcio==1.80.0protobuf==6.31.1python-dateutiltenacityImportant Notes
Section titled “Important Notes”Versioning
Section titled “Versioning”- Packages with versions (e.g.
beautifulsoup4==4.14.2) will be installed exactly as specified - Packages without versions will install the latest available version
Installation
Section titled “Installation”- Dependencies are installed automatically by the platform
- Installation time depends on network speed and package size
- Errors will be displayed if installation fails
Ensuring Proper Execution
Section titled “Ensuring Proper Execution”- grpcio and protobuf must be included (required by the SDK)
- protobuf version must match the one used to generate
sdk_pb2.py(checksdk_pb2.pyheader for the exact version) - All third-party libraries must be listed
- Core dependencies should use fixed versions for stability
- Update dependencies regularly for security and bug fixes
Q: Why specify versions? A: To ensure consistent behavior across development, testing, and production environments.
Q: What if I don’t specify a version? A: The latest version will be installed, which may cause compatibility issues. For core dependencies, pinning versions is recommended.
Q: How do I add new dependencies?
A: Add a new line to requirements.txt and re-upload the ZIP package. The platform will install them on the next run.
Q: What if installation fails? A: Check network connectivity or package mirrors. If the issue persists, verify the package name and version.
Q: Can I use both sync and async code in the same Worker?
A: Yes. CoreClaw supports both styles. Choose the one that best fits your use case. Async is recommended when using async libraries like aiohttp or when you need concurrent I/O.