AskPablos Scrapy API - Usage Guide

This guide walks you through how to configure and use AskPablosScrapyAPI in your Scrapy project.

Table of Contents

Installation
Configuration
Basic Usage
Advanced Usage
Configuration Options
Best Practices
Troubleshooting

Installation

AskPablos Scrapy API requires Python 3.9 or higher and Scrapy 2.6.0 or higher.

Install the package using pip:

pip install askpablos-scrapy-api

Or directly from the repository:

pip install git+https://github.com/fawadss1/askpablos-scrapy-api.git

Configuration

Global Settings (settings.py)

Configure the middleware globally in your project’s settings.py file:

# Required settings
API_KEY = "your_api_key"          # Your AskPablos API key
SECRET_KEY = "your_secret_key"    # Your AskPablos secret key

# Optional global settings
APCLOUDY_URL = "https://domain.com"  # Base URL for AskPablos API (optional)
TIMEOUT = 30          # Request timeout in seconds
MAX_RETRIES = 2       # Maximum number of retries for failed requests

# Add the middleware
DOWNLOADER_MIDDLEWARES = {
    'askpablos_scrapy_api.middleware.AskPablosAPIDownloaderMiddleware': 585,
}

Per-Request Configuration

Configure individual requests using the askpablos_api_map in request meta:

meta = {
    "askpablos_api_map": {
        "browser": True,              # Use headless browser
        "screenshot": True,           # Take screenshot (requires browser: True)
        "operations": [...],          # Browser operations for SPA interaction (requires browser: True)
        "geoLocation": "US",          # Target country (2-letter ISO code, e.g. "PK", "US", "GB")
        "proxyType": "residential"    # Proxy type: "datacenter", "residential", or "mobile"
    }
}

Basic Usage

Simple GET Request with Browser Rendering

import scrapy

class MySpider(scrapy.Spider):
    name = 'example'

    def start_requests(self):
        yield scrapy.Request(
            url='https://example.com',
            meta={
                "askpablos_api_map": {
                    "browser": True
                }
            },
            callback=self.parse
        )

    def parse(self, response):
        # Process the response normally
        for item in response.css('.item'):
            yield {
                'title': item.css('h2::text').get(),
                'description': item.css('p::text').get()
            }

POST Request Support

import scrapy
import json

class MySpider(scrapy.Spider):
    name = 'example'

    def start_requests(self):
        # Using FormRequest for POST requests
        yield scrapy.FormRequest(
            url='https://api.example.com/endpoint',
            formdata={'key': 'value'},
            meta={
                "askpablos_api_map": {
                    "browser": True
                }
            },
            callback=self.parse
        )

        # Or using Request with method='POST' and JSON body
        yield scrapy.Request(
            url='https://api.example.com/endpoint',
            method='POST',
            body=json.dumps({'key': 'value'}),
            headers={'Content-Type': 'application/json'},
            meta={
                "askpablos_api_map": {}
            },
            callback=self.parse
        )

    def parse(self, response):
        # Process the response
        data = response.json()
        yield {'result': data}

Advanced Usage

Geo-Targeted Requests

Route requests through a proxy in a specific country using the geoLocation option:

def start_requests(self):
    yield scrapy.Request(
        url='https://example.com',
        meta={
            "askpablos_api_map": {
                "geoLocation": "US"   # 2-letter ISO country code
            }
        },
        callback=self.parse
    )

Supported values are standard ISO 3166-1 alpha-2 country codes such as "US", "PK", "GB", "DE", "FR", etc. The value is case-insensitive and will be normalized to uppercase internally.

Choosing Proxy Type

Control the type of proxy used with the proxyType option:

def start_requests(self):
    yield scrapy.Request(
        url='https://example.com',
        meta={
            "askpablos_api_map": {
                "proxyType": "residential"   # "datacenter", "residential", or "mobile"
            }
        },
        callback=self.parse
    )

Value	Description
`"datacenter"`	Fast, cost-efficient proxies hosted in data centers
`"residential"`	IPs assigned by ISPs to real home users — high trust, harder to detect
`"mobile"`	IPs from mobile carriers — highest trust for mobile-targeted sites

Both options can be combined freely with browser, screenshot, operations, etc.:

meta = {
    "askpablos_api_map": {
        "browser": True,
        "screenshot": True,
        "geoLocation": "GB",
        "proxyType": "mobile"
    }
}

Screenshot Capture

def start_requests(self):
    yield scrapy.Request(
        url='https://example.com',
        meta={
            "askpablos_api_map": {
                "browser": True,
                "screenshot": True
            }
        },
        callback=self.parse_with_screenshot
    )

def parse_with_screenshot(self, response):
    # Access screenshot data
    screenshot = response.meta.get('screenshot')
    if screenshot:
        with open('page_screenshot.png', 'wb') as f:
            f.write(screenshot)

SPA (Single Page Application) Handling

Basic SPA Support

meta = {
    "askpablos_api_map": {
        "browser": True
    }
}

Advanced SPA Interaction with Operations

For more complex SPAs that require waiting for specific elements or performing actions:

def start_requests(self):
    yield scrapy.Request(
        url='https://spa-example.com',
        meta={
            "askpablos_api_map": {
                "browser": True,
                "operations": [
                    {
                        "task": "waitForElement",
                        "match": {
                            "on": "xpath",
                            "rule": "visible",
                            "value": "//*[@id='content-loaded']"
                        },
                        "maxWait": 10,
                        "onFailure": "return"
                    }
                ]
            }
        },
        callback=self.parse_spa
    )

def parse_spa(self, response):
    # The page has waited for the element to be visible
    data = response.css('.dynamic-content::text').getall()
    yield {'data': data}

Operations Parameters:

task: Action to perform
- waitForElement - Wait for element to match condition
match: Element matching criteria
- on: "xpath" or "css" - Selector type
- rule: Element state to wait for
  - "visible" - Element is visible on page
  - "attached" - Element exists in DOM
  - "hidden" - Element exists but is hidden
  - "detached" - Element is removed from DOM
- value: Selector string (XPath or CSS selector)
maxWait (optional): Maximum seconds to wait (must be > 0)
onFailure (optional): Action when operation fails
- "continue" - Ignore failure and continue
- "return" - Stop operations and return page
- "throw" - Raise an error

Multiple Operations Example

You can chain multiple operations:

meta = {
    "askpablos_api_map": {
        "browser": True,
        "operations": [
            {
                "task": "waitForElement",
                "match": {
                    "on": "css",
                    "rule": "visible",
                    "value": "#login-form"
                },
                "maxWait": 5,
                "onFailure": "throw"
            },
            {
                "task": "waitForElement",
                "match": {
                    "on": "xpath",
                    "rule": "visible",
                    "value": "//div[@class='content-loaded']"
                },
                "maxWait": 15,
                "onFailure": "return"
            }
        ]
    }
}

Configuration Options

Meta Configuration Options

Option	Type	Required	Description
`browser`	bool	No	Use headless browser rendering
`screenshot`	bool	No	Take screenshot of the page (requires `browser: True`)
`operations`	list	No	Browser operations for SPA interaction (requires `browser: True`)
`geoLocation`	str	No	2-letter ISO country code for geo-targeting (e.g. `"US"`, `"PK"`)
`proxyType`	str	No	Proxy type: `"datacenter"`, `"residential"`, or `"mobile"`

Important Note: The options screenshot and operations only work when browser: True is set. If browser rendering is disabled, these options will be ignored.

Settings.py Configuration

Setting	Type	Default	Description
`API_KEY`	str	Required	Your AskPablos API key
`SECRET_KEY`	str	Required	Your AskPablos secret key
`APCLOUDY_URL`	str	https://appcloudy.askpablos.com	Base URL for AskPablos API
`TIMEOUT`	int	30	Request timeout in seconds
`MAX_RETRIES`	int	2	Maximum retry attempts

Best Practices

Configure timeouts appropriately:
- Set reasonable TIMEOUT values in settings.py
- Consider page complexity when setting timeouts
Use screenshots for debugging:
- Enable screenshots when troubleshooting
- Disable in production unless necessary
Optimize retry settings:
- Configure MAX_RETRIES globally in settings.py

Troubleshooting

Common Issues

Authentication Errors: Verify your API_KEY and SECRET_KEY
Timeout Issues: Increase TIMEOUT in settings.py
Rate Limiting: Reduce concurrent requests in your spider