# AskPablos Scrapy API - Usage Guide This guide walks you through how to configure and use AskPablosScrapyAPI in your Scrapy project. ## Table of Contents - [Installation](#installation) - [Configuration](#configuration) - [Basic Usage](#basic-usage) - [Advanced Usage](#advanced-usage) - [Configuration Options](#configuration-options) - [Best Practices](#best-practices) - [Troubleshooting](#troubleshooting) --- ## Installation AskPablos Scrapy API requires Python 3.9 or higher and Scrapy 2.6.0 or higher. Install the package using pip: ```bash pip install askpablos-scrapy-api ``` Or directly from the repository: ```bash pip install git+https://github.com/fawadss1/askpablos-scrapy-api.git ``` --- ## Configuration ### Global Settings (settings.py) Configure the middleware globally in your project's `settings.py` file: ```python # Required settings API_KEY = "your_api_key" # Your AskPablos API key SECRET_KEY = "your_secret_key" # Your AskPablos secret key # Optional global settings APCLOUDY_URL = "https://domain.com" # Base URL for AskPablos API (optional) TIMEOUT = 30 # Request timeout in seconds MAX_RETRIES = 2 # Maximum number of retries for failed requests # Add the middleware DOWNLOADER_MIDDLEWARES = { 'askpablos_scrapy_api.middleware.AskPablosAPIDownloaderMiddleware': 585, } ``` ### Per-Request Configuration Configure individual requests using the `askpablos_api_map` in request meta: ```python meta = { "askpablos_api_map": { "browser": True, # Use headless browser "screenshot": True, # Take screenshot (requires browser: True) "operations": [...], # Browser operations for SPA interaction (requires browser: True) "geoLocation": "US", # Target country (2-letter ISO code, e.g. "PK", "US", "GB") "proxyType": "residential" # Proxy type: "datacenter", "residential", or "mobile" } } ``` --- ## Basic Usage ### Simple GET Request with Browser Rendering ```python import scrapy class MySpider(scrapy.Spider): name = 'example' def start_requests(self): yield scrapy.Request( url='https://example.com', meta={ "askpablos_api_map": { "browser": True } }, callback=self.parse ) def parse(self, response): # Process the response normally for item in response.css('.item'): yield { 'title': item.css('h2::text').get(), 'description': item.css('p::text').get() } ``` ### POST Request Support ```python import scrapy import json class MySpider(scrapy.Spider): name = 'example' def start_requests(self): # Using FormRequest for POST requests yield scrapy.FormRequest( url='https://api.example.com/endpoint', formdata={'key': 'value'}, meta={ "askpablos_api_map": { "browser": True } }, callback=self.parse ) # Or using Request with method='POST' and JSON body yield scrapy.Request( url='https://api.example.com/endpoint', method='POST', body=json.dumps({'key': 'value'}), headers={'Content-Type': 'application/json'}, meta={ "askpablos_api_map": {} }, callback=self.parse ) def parse(self, response): # Process the response data = response.json() yield {'result': data} ``` --- ## Advanced Usage ### Geo-Targeted Requests Route requests through a proxy in a specific country using the `geoLocation` option: ```python def start_requests(self): yield scrapy.Request( url='https://example.com', meta={ "askpablos_api_map": { "geoLocation": "US" # 2-letter ISO country code } }, callback=self.parse ) ``` Supported values are standard ISO 3166-1 alpha-2 country codes such as `"US"`, `"PK"`, `"GB"`, `"DE"`, `"FR"`, etc. The value is case-insensitive and will be normalized to uppercase internally. ### Choosing Proxy Type Control the type of proxy used with the `proxyType` option: ```python def start_requests(self): yield scrapy.Request( url='https://example.com', meta={ "askpablos_api_map": { "proxyType": "residential" # "datacenter", "residential", or "mobile" } }, callback=self.parse ) ``` | Value | Description | |---|---| | `"datacenter"` | Fast, cost-efficient proxies hosted in data centers | | `"residential"` | IPs assigned by ISPs to real home users — high trust, harder to detect | | `"mobile"` | IPs from mobile carriers — highest trust for mobile-targeted sites | Both options can be combined freely with `browser`, `screenshot`, `operations`, etc.: ```python meta = { "askpablos_api_map": { "browser": True, "screenshot": True, "geoLocation": "GB", "proxyType": "mobile" } } ``` ### Screenshot Capture ```python def start_requests(self): yield scrapy.Request( url='https://example.com', meta={ "askpablos_api_map": { "browser": True, "screenshot": True } }, callback=self.parse_with_screenshot ) def parse_with_screenshot(self, response): # Access screenshot data screenshot = response.meta.get('screenshot') if screenshot: with open('page_screenshot.png', 'wb') as f: f.write(screenshot) ``` ### SPA (Single Page Application) Handling #### Basic SPA Support ```python meta = { "askpablos_api_map": { "browser": True } } ``` #### Advanced SPA Interaction with Operations For more complex SPAs that require waiting for specific elements or performing actions: ```python def start_requests(self): yield scrapy.Request( url='https://spa-example.com', meta={ "askpablos_api_map": { "browser": True, "operations": [ { "task": "waitForElement", "match": { "on": "xpath", "rule": "visible", "value": "//*[@id='content-loaded']" }, "maxWait": 10, "onFailure": "return" } ] } }, callback=self.parse_spa ) def parse_spa(self, response): # The page has waited for the element to be visible data = response.css('.dynamic-content::text').getall() yield {'data': data} ``` **Operations Parameters:** - **task**: Action to perform - `waitForElement` - Wait for element to match condition - **match**: Element matching criteria - `on`: `"xpath"` or `"css"` - Selector type - `rule`: Element state to wait for - `"visible"` - Element is visible on page - `"attached"` - Element exists in DOM - `"hidden"` - Element exists but is hidden - `"detached"` - Element is removed from DOM - `value`: Selector string (XPath or CSS selector) - **maxWait** (optional): Maximum seconds to wait (must be > 0) - **onFailure** (optional): Action when operation fails - `"continue"` - Ignore failure and continue - `"return"` - Stop operations and return page - `"throw"` - Raise an error #### Multiple Operations Example You can chain multiple operations: ```python meta = { "askpablos_api_map": { "browser": True, "operations": [ { "task": "waitForElement", "match": { "on": "css", "rule": "visible", "value": "#login-form" }, "maxWait": 5, "onFailure": "throw" }, { "task": "waitForElement", "match": { "on": "xpath", "rule": "visible", "value": "//div[@class='content-loaded']" }, "maxWait": 15, "onFailure": "return" } ] } } ``` --- ## Configuration Options ### Meta Configuration Options | Option | Type | Required | Description | |-----------------|----------|----------|---------------------------------------------------------------------| | `browser` | bool | No | Use headless browser rendering | | `screenshot` | bool | No | Take screenshot of the page (requires `browser: True`) | | `operations` | list | No | Browser operations for SPA interaction (requires `browser: True`) | | `geoLocation` | str | No | 2-letter ISO country code for geo-targeting (e.g. `"US"`, `"PK"`) | | `proxyType` | str | No | Proxy type: `"datacenter"`, `"residential"`, or `"mobile"` | **Important Note:** The options `screenshot` and `operations` only work when `browser: True` is set. If browser rendering is disabled, these options will be ignored. ### Settings.py Configuration | Setting | Type | Default | Description | |-----------------|------|---------------------------------------|-------------------------------------------| | `API_KEY` | str | Required | Your AskPablos API key | | `SECRET_KEY` | str | Required | Your AskPablos secret key | | `APCLOUDY_URL` | str | https://appcloudy.askpablos.com | Base URL for AskPablos API | | `TIMEOUT` | int | 30 | Request timeout in seconds | | `MAX_RETRIES` | int | 2 | Maximum retry attempts | --- ## Best Practices 1. **Configure timeouts appropriately**: - Set reasonable `TIMEOUT` values in settings.py - Consider page complexity when setting timeouts 2. **Use screenshots for debugging**: - Enable screenshots when troubleshooting - Disable in production unless necessary 3. **Optimize retry settings**: - Configure `MAX_RETRIES` globally in settings.py --- ## Troubleshooting ### Common Issues 1. **Authentication Errors**: Verify your API_KEY and SECRET_KEY 2. **Timeout Issues**: Increase TIMEOUT in settings.py 3. **Rate Limiting**: Reduce concurrent requests in your spider