AskPablos Scrapy API - Usage Guide
This guide walks you through how to configure and use AskPablosScrapyAPI in your Scrapy project.
Table of Contents
Installation
AskPablos Scrapy API requires Python 3.9 or higher and Scrapy 2.6.0 or higher.
Install the package using pip:
pip install askpablos-scrapy-api
Or directly from the repository:
pip install git+https://github.com/fawadss1/askpablos-scrapy-api.git
Configuration
Global Settings (settings.py)
Configure the middleware globally in your project’s settings.py file:
# Required settings
API_KEY = "your_api_key" # Your AskPablos API key
SECRET_KEY = "your_secret_key" # Your AskPablos secret key
# Optional global settings
APCLOUDY_URL = "https://domain.com" # Base URL for AskPablos API (optional)
TIMEOUT = 30 # Request timeout in seconds
MAX_RETRIES = 2 # Maximum number of retries for failed requests
# Add the middleware
DOWNLOADER_MIDDLEWARES = {
'askpablos_scrapy_api.middleware.AskPablosAPIDownloaderMiddleware': 585,
}
Per-Request Configuration
Configure individual requests using the askpablos_api_map in request meta:
meta = {
"askpablos_api_map": {
"browser": True, # Use headless browser
"screenshot": True, # Take screenshot (requires browser: True)
"operations": [...], # Browser operations for SPA interaction (requires browser: True)
"geoLocation": "US", # Target country (2-letter ISO code, e.g. "PK", "US", "GB")
"proxyType": "residential" # Proxy type: "datacenter", "residential", or "mobile"
}
}
Basic Usage
Simple GET Request with Browser Rendering
import scrapy
class MySpider(scrapy.Spider):
name = 'example'
def start_requests(self):
yield scrapy.Request(
url='https://example.com',
meta={
"askpablos_api_map": {
"browser": True
}
},
callback=self.parse
)
def parse(self, response):
# Process the response normally
for item in response.css('.item'):
yield {
'title': item.css('h2::text').get(),
'description': item.css('p::text').get()
}
POST Request Support
import scrapy
import json
class MySpider(scrapy.Spider):
name = 'example'
def start_requests(self):
# Using FormRequest for POST requests
yield scrapy.FormRequest(
url='https://api.example.com/endpoint',
formdata={'key': 'value'},
meta={
"askpablos_api_map": {
"browser": True
}
},
callback=self.parse
)
# Or using Request with method='POST' and JSON body
yield scrapy.Request(
url='https://api.example.com/endpoint',
method='POST',
body=json.dumps({'key': 'value'}),
headers={'Content-Type': 'application/json'},
meta={
"askpablos_api_map": {}
},
callback=self.parse
)
def parse(self, response):
# Process the response
data = response.json()
yield {'result': data}
Advanced Usage
Geo-Targeted Requests
Route requests through a proxy in a specific country using the geoLocation option:
def start_requests(self):
yield scrapy.Request(
url='https://example.com',
meta={
"askpablos_api_map": {
"geoLocation": "US" # 2-letter ISO country code
}
},
callback=self.parse
)
Supported values are standard ISO 3166-1 alpha-2 country codes such as "US", "PK", "GB", "DE", "FR", etc. The value is case-insensitive and will be normalized to uppercase internally.
Choosing Proxy Type
Control the type of proxy used with the proxyType option:
def start_requests(self):
yield scrapy.Request(
url='https://example.com',
meta={
"askpablos_api_map": {
"proxyType": "residential" # "datacenter", "residential", or "mobile"
}
},
callback=self.parse
)
Value |
Description |
|---|---|
|
Fast, cost-efficient proxies hosted in data centers |
|
IPs assigned by ISPs to real home users — high trust, harder to detect |
|
IPs from mobile carriers — highest trust for mobile-targeted sites |
Both options can be combined freely with browser, screenshot, operations, etc.:
meta = {
"askpablos_api_map": {
"browser": True,
"screenshot": True,
"geoLocation": "GB",
"proxyType": "mobile"
}
}
Screenshot Capture
def start_requests(self):
yield scrapy.Request(
url='https://example.com',
meta={
"askpablos_api_map": {
"browser": True,
"screenshot": True
}
},
callback=self.parse_with_screenshot
)
def parse_with_screenshot(self, response):
# Access screenshot data
screenshot = response.meta.get('screenshot')
if screenshot:
with open('page_screenshot.png', 'wb') as f:
f.write(screenshot)
SPA (Single Page Application) Handling
Basic SPA Support
meta = {
"askpablos_api_map": {
"browser": True
}
}
Advanced SPA Interaction with Operations
For more complex SPAs that require waiting for specific elements or performing actions:
def start_requests(self):
yield scrapy.Request(
url='https://spa-example.com',
meta={
"askpablos_api_map": {
"browser": True,
"operations": [
{
"task": "waitForElement",
"match": {
"on": "xpath",
"rule": "visible",
"value": "//*[@id='content-loaded']"
},
"maxWait": 10,
"onFailure": "return"
}
]
}
},
callback=self.parse_spa
)
def parse_spa(self, response):
# The page has waited for the element to be visible
data = response.css('.dynamic-content::text').getall()
yield {'data': data}
Operations Parameters:
task: Action to perform
waitForElement- Wait for element to match condition
match: Element matching criteria
on:"xpath"or"css"- Selector typerule: Element state to wait for"visible"- Element is visible on page"attached"- Element exists in DOM"hidden"- Element exists but is hidden"detached"- Element is removed from DOM
value: Selector string (XPath or CSS selector)
maxWait (optional): Maximum seconds to wait (must be > 0)
onFailure (optional): Action when operation fails
"continue"- Ignore failure and continue"return"- Stop operations and return page"throw"- Raise an error
Multiple Operations Example
You can chain multiple operations:
meta = {
"askpablos_api_map": {
"browser": True,
"operations": [
{
"task": "waitForElement",
"match": {
"on": "css",
"rule": "visible",
"value": "#login-form"
},
"maxWait": 5,
"onFailure": "throw"
},
{
"task": "waitForElement",
"match": {
"on": "xpath",
"rule": "visible",
"value": "//div[@class='content-loaded']"
},
"maxWait": 15,
"onFailure": "return"
}
]
}
}
Configuration Options
Meta Configuration Options
Option |
Type |
Required |
Description |
|---|---|---|---|
|
bool |
No |
Use headless browser rendering |
|
bool |
No |
Take screenshot of the page (requires |
|
list |
No |
Browser operations for SPA interaction (requires |
|
str |
No |
2-letter ISO country code for geo-targeting (e.g. |
|
str |
No |
Proxy type: |
Important Note: The options screenshot and operations only work when browser: True is set. If browser rendering is disabled, these options will be ignored.
Settings.py Configuration
Setting |
Type |
Default |
Description |
|---|---|---|---|
|
str |
Required |
Your AskPablos API key |
|
str |
Required |
Your AskPablos secret key |
|
str |
https://appcloudy.askpablos.com |
Base URL for AskPablos API |
|
int |
30 |
Request timeout in seconds |
|
int |
2 |
Maximum retry attempts |
Best Practices
Configure timeouts appropriately:
Set reasonable
TIMEOUTvalues in settings.pyConsider page complexity when setting timeouts
Use screenshots for debugging:
Enable screenshots when troubleshooting
Disable in production unless necessary
Optimize retry settings:
Configure
MAX_RETRIESglobally in settings.py
Troubleshooting
Common Issues
Authentication Errors: Verify your API_KEY and SECRET_KEY
Timeout Issues: Increase TIMEOUT in settings.py
Rate Limiting: Reduce concurrent requests in your spider