AskPablos Scrapy API - Usage Guide

This guide walks you through how to configure and use AskPablosScrapyAPI in your Scrapy project.

Table of Contents


Installation

AskPablos Scrapy API requires Python 3.9 or higher and Scrapy 2.6.0 or higher.

Install the package using pip:

pip install askpablos-scrapy-api

Or directly from the repository:

pip install git+https://github.com/fawadss1/askpablos-scrapy-api.git

Configuration

Global Settings (settings.py)

Configure the middleware globally in your project’s settings.py file:

# Required settings
API_KEY = "your_api_key"          # Your AskPablos API key
SECRET_KEY = "your_secret_key"    # Your AskPablos secret key

# Optional global settings
APCLOUDY_URL = "https://domain.com"  # Base URL for AskPablos API (optional)
TIMEOUT = 30          # Request timeout in seconds
MAX_RETRIES = 2       # Maximum number of retries for failed requests

# Add the middleware
DOWNLOADER_MIDDLEWARES = {
    'askpablos_scrapy_api.middleware.AskPablosAPIDownloaderMiddleware': 585,
}

Per-Request Configuration

Configure individual requests using the askpablos_api_map in request meta:

meta = {
    "askpablos_api_map": {
        "browser": True,              # Use headless browser
        "screenshot": True,           # Take screenshot (requires browser: True)
        "operations": [...],          # Browser operations for SPA interaction (requires browser: True)
        "geoLocation": "US",          # Target country (2-letter ISO code, e.g. "PK", "US", "GB")
        "proxyType": "residential"    # Proxy type: "datacenter", "residential", or "mobile"
    }
}

Basic Usage

Simple GET Request with Browser Rendering

import scrapy

class MySpider(scrapy.Spider):
    name = 'example'

    def start_requests(self):
        yield scrapy.Request(
            url='https://example.com',
            meta={
                "askpablos_api_map": {
                    "browser": True
                }
            },
            callback=self.parse
        )

    def parse(self, response):
        # Process the response normally
        for item in response.css('.item'):
            yield {
                'title': item.css('h2::text').get(),
                'description': item.css('p::text').get()
            }

POST Request Support

import scrapy
import json

class MySpider(scrapy.Spider):
    name = 'example'

    def start_requests(self):
        # Using FormRequest for POST requests
        yield scrapy.FormRequest(
            url='https://api.example.com/endpoint',
            formdata={'key': 'value'},
            meta={
                "askpablos_api_map": {
                    "browser": True
                }
            },
            callback=self.parse
        )

        # Or using Request with method='POST' and JSON body
        yield scrapy.Request(
            url='https://api.example.com/endpoint',
            method='POST',
            body=json.dumps({'key': 'value'}),
            headers={'Content-Type': 'application/json'},
            meta={
                "askpablos_api_map": {}
            },
            callback=self.parse
        )

    def parse(self, response):
        # Process the response
        data = response.json()
        yield {'result': data}

Advanced Usage

Geo-Targeted Requests

Route requests through a proxy in a specific country using the geoLocation option:

def start_requests(self):
    yield scrapy.Request(
        url='https://example.com',
        meta={
            "askpablos_api_map": {
                "geoLocation": "US"   # 2-letter ISO country code
            }
        },
        callback=self.parse
    )

Supported values are standard ISO 3166-1 alpha-2 country codes such as "US", "PK", "GB", "DE", "FR", etc. The value is case-insensitive and will be normalized to uppercase internally.

Choosing Proxy Type

Control the type of proxy used with the proxyType option:

def start_requests(self):
    yield scrapy.Request(
        url='https://example.com',
        meta={
            "askpablos_api_map": {
                "proxyType": "residential"   # "datacenter", "residential", or "mobile"
            }
        },
        callback=self.parse
    )

Value

Description

"datacenter"

Fast, cost-efficient proxies hosted in data centers

"residential"

IPs assigned by ISPs to real home users — high trust, harder to detect

"mobile"

IPs from mobile carriers — highest trust for mobile-targeted sites

Both options can be combined freely with browser, screenshot, operations, etc.:

meta = {
    "askpablos_api_map": {
        "browser": True,
        "screenshot": True,
        "geoLocation": "GB",
        "proxyType": "mobile"
    }
}

Screenshot Capture

def start_requests(self):
    yield scrapy.Request(
        url='https://example.com',
        meta={
            "askpablos_api_map": {
                "browser": True,
                "screenshot": True
            }
        },
        callback=self.parse_with_screenshot
    )

def parse_with_screenshot(self, response):
    # Access screenshot data
    screenshot = response.meta.get('screenshot')
    if screenshot:
        with open('page_screenshot.png', 'wb') as f:
            f.write(screenshot)

SPA (Single Page Application) Handling

Basic SPA Support

meta = {
    "askpablos_api_map": {
        "browser": True
    }
}

Advanced SPA Interaction with Operations

For more complex SPAs that require waiting for specific elements or performing actions:

def start_requests(self):
    yield scrapy.Request(
        url='https://spa-example.com',
        meta={
            "askpablos_api_map": {
                "browser": True,
                "operations": [
                    {
                        "task": "waitForElement",
                        "match": {
                            "on": "xpath",
                            "rule": "visible",
                            "value": "//*[@id='content-loaded']"
                        },
                        "maxWait": 10,
                        "onFailure": "return"
                    }
                ]
            }
        },
        callback=self.parse_spa
    )

def parse_spa(self, response):
    # The page has waited for the element to be visible
    data = response.css('.dynamic-content::text').getall()
    yield {'data': data}

Operations Parameters:

  • task: Action to perform

    • waitForElement - Wait for element to match condition

  • match: Element matching criteria

    • on: "xpath" or "css" - Selector type

    • rule: Element state to wait for

      • "visible" - Element is visible on page

      • "attached" - Element exists in DOM

      • "hidden" - Element exists but is hidden

      • "detached" - Element is removed from DOM

    • value: Selector string (XPath or CSS selector)

  • maxWait (optional): Maximum seconds to wait (must be > 0)

  • onFailure (optional): Action when operation fails

    • "continue" - Ignore failure and continue

    • "return" - Stop operations and return page

    • "throw" - Raise an error

Multiple Operations Example

You can chain multiple operations:

meta = {
    "askpablos_api_map": {
        "browser": True,
        "operations": [
            {
                "task": "waitForElement",
                "match": {
                    "on": "css",
                    "rule": "visible",
                    "value": "#login-form"
                },
                "maxWait": 5,
                "onFailure": "throw"
            },
            {
                "task": "waitForElement",
                "match": {
                    "on": "xpath",
                    "rule": "visible",
                    "value": "//div[@class='content-loaded']"
                },
                "maxWait": 15,
                "onFailure": "return"
            }
        ]
    }
}

Configuration Options

Meta Configuration Options

Option

Type

Required

Description

browser

bool

No

Use headless browser rendering

screenshot

bool

No

Take screenshot of the page (requires browser: True)

operations

list

No

Browser operations for SPA interaction (requires browser: True)

geoLocation

str

No

2-letter ISO country code for geo-targeting (e.g. "US", "PK")

proxyType

str

No

Proxy type: "datacenter", "residential", or "mobile"

Important Note: The options screenshot and operations only work when browser: True is set. If browser rendering is disabled, these options will be ignored.

Settings.py Configuration

Setting

Type

Default

Description

API_KEY

str

Required

Your AskPablos API key

SECRET_KEY

str

Required

Your AskPablos secret key

APCLOUDY_URL

str

https://appcloudy.askpablos.com

Base URL for AskPablos API

TIMEOUT

int

30

Request timeout in seconds

MAX_RETRIES

int

2

Maximum retry attempts


Best Practices

  1. Configure timeouts appropriately:

    • Set reasonable TIMEOUT values in settings.py

    • Consider page complexity when setting timeouts

  2. Use screenshots for debugging:

    • Enable screenshots when troubleshooting

    • Disable in production unless necessary

  3. Optimize retry settings:

    • Configure MAX_RETRIES globally in settings.py


Troubleshooting

Common Issues

  1. Authentication Errors: Verify your API_KEY and SECRET_KEY

  2. Timeout Issues: Increase TIMEOUT in settings.py

  3. Rate Limiting: Reduce concurrent requests in your spider