AskPablos Scrapy API
A professional Scrapy integration for seamlessly routing requests through AskPablos Proxy API with support for headless browser rendering and JavaScript strategies.
Key Features
🔄 Selective Proxying: Only routes requests with
askpablos_api_mapin their meta🌐 Headless Browser Support: Render JavaScript-heavy pages
🔄 Rotating Proxies: Access to a pool of rotating IP addresses
🧠 JavaScript Rendering: Render JavaScript-heavy pages
📸 Screenshot Capture: Take screenshots
🎯 SPA Operations: Advanced browser operations for interacting with Single Page Applications
🔒 Secure Authentication: HMAC-SHA256 request signing
🔁 Automatic Retries: Configurable retry logic
⚠️ Comprehensive Error Handling: Detailed logging and error reporting
Requirements
Python 3.9+
Scrapy 2.6+
Valid AskPablos Proxy API credentials
Installation
pip install askpablos-scrapy-api
Or install directly from GitHub:
pip install git+https://github.com/fawadss1/askpablos-scrapy-api.git
Quick Start
1. Configure Settings
Add to your settings.py:
# Required settings
API_KEY = "your_api_key" # Your AskPablos API key
SECRET_KEY = "your_secret_key" # Your AskPablos secret key
# Optional global settings
TIMEOUT = 30 # Request timeout in seconds
MAX_RETRIES = 2 # Maximum number of retries
# Add the middleware
DOWNLOADER_MIDDLEWARES = {
'askpablos_scrapy_api.middleware.AskPablosAPIDownloaderMiddleware': 585,
}
2. Use in Your Spider
import scrapy
class MySpider(scrapy.Spider):
name = 'example'
def start_requests(self):
yield scrapy.Request(
url='https://example.com',
meta={
"askpablos_api_map": {
"browser": True
}
}
)
Configuration Options
Meta Configuration
Option |
Type |
Description |
|---|---|---|
|
bool |
Use headless browser rendering |
|
bool |
Take screenshot of the page (requires browser: True) |
|
list |
Browser operations for SPA interaction (requires browser: True) |
Important Note: The options screenshot and operations only work when browser: True is set.
Environment Variables
Instead of putting sensitive API keys in your settings file, you can use environment variables:
# Set these environment variables before running your spider
export ASKPABLOS_API_KEY="your_api_key"
export ASKPABLOS_SECRET_KEY="your_secret_key"
Documentation
Detailed Usage Guide - Complete instructions for configuring and using AskPablos Scrapy API
FAQ - Answers to common questions and troubleshooting help
How It Works
AskPablos Scrapy API intercepts requests with the askpablos_api_map in their meta dictionary and routes them through the AskPablos proxy service. The service can process requests using headless browsers and/or rotating proxies as specified, before returning the HTML response.
Requests without the askpablos_api_map configuration bypass the processing entirely, giving you full control over which requests use the proxy service.
Advanced Configuration
All Available Options
# Request with all available options
yield scrapy.Request(
url="https://example.com",
callback=self.parse,
meta={
'askpablos_api_map': {
'browser': True, # Use headless browser
'screenshot': True, # Take screenshot
}
}
)
SPA Interaction with Operations
For Single Page Applications that require waiting for specific elements:
import scrapy
class MySPASpider(scrapy.Spider):
name = 'spa_example'
def start_requests(self):
yield scrapy.Request(
url='https://spa-example.com',
meta={
"askpablos_api_map": {
"browser": True,
"operations": [
{
"task": "waitForElement",
"match": {
"on": "xpath", # or "css"
"rule": "visible", # or "attached", "hidden", "detached"
"value": "//*[@id='content-loaded']"
},
"maxWait": 10,
"onFailure": "return" # or "continue", "throw"
}
]
}
},
callback=self.parse
)
def parse(self, response):
# Element has loaded before parsing
data = response.css('.dynamic-content::text').getall()
yield {'data': data}
See the Detailed Usage Guide for more examples and complete documentation.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.