# AskPablos Scrapy API A professional Scrapy integration for seamlessly routing requests through AskPablos Proxy API with support for headless browser rendering and JavaScript strategies. ## Key Features - 🔄 **Selective Proxying**: Only routes requests with `askpablos_api_map` in their meta - 🌐 **Headless Browser Support**: Render JavaScript-heavy pages - 🔄 **Rotating Proxies**: Access to a pool of rotating IP addresses - 🧠 **JavaScript Rendering**: Render JavaScript-heavy pages - 📸 **Screenshot Capture**: Take screenshots - 🎯 **SPA Operations**: Advanced browser operations for interacting with Single Page Applications - 🔒 **Secure Authentication**: HMAC-SHA256 request signing - 🔁 **Automatic Retries**: Configurable retry logic - ⚠️ **Comprehensive Error Handling**: Detailed logging and error reporting ## Requirements - Python 3.9+ - Scrapy 2.6+ - Valid AskPablos Proxy API credentials ## Installation ```bash pip install askpablos-scrapy-api ``` Or install directly from GitHub: ```bash pip install git+https://github.com/fawadss1/askpablos-scrapy-api.git ``` ## Quick Start ### 1. Configure Settings Add to your `settings.py`: ```python # Required settings API_KEY = "your_api_key" # Your AskPablos API key SECRET_KEY = "your_secret_key" # Your AskPablos secret key # Optional global settings TIMEOUT = 30 # Request timeout in seconds MAX_RETRIES = 2 # Maximum number of retries # Add the middleware DOWNLOADER_MIDDLEWARES = { 'askpablos_scrapy_api.middleware.AskPablosAPIDownloaderMiddleware': 585, } ``` ### 2. Use in Your Spider ```python import scrapy class MySpider(scrapy.Spider): name = 'example' def start_requests(self): yield scrapy.Request( url='https://example.com', meta={ "askpablos_api_map": { "browser": True } } ) ``` ## Configuration Options ### Meta Configuration | Option | Type | Description | |-----------------|----------|--------------------------------------------------------| | `browser` | bool | Use headless browser rendering | | `screenshot` | bool | Take screenshot of the page (requires browser: True) | | `operations` | list | Browser operations for SPA interaction (requires browser: True) | **Important Note:** The options `screenshot` and `operations` only work when `browser: True` is set. ## Environment Variables Instead of putting sensitive API keys in your settings file, you can use environment variables: ```bash # Set these environment variables before running your spider export ASKPABLOS_API_KEY="your_api_key" export ASKPABLOS_SECRET_KEY="your_secret_key" ``` ## Documentation - [Detailed Usage Guide](usage.md) - Complete instructions for configuring and using AskPablos Scrapy API - [FAQ](faq.md) - Answers to common questions and troubleshooting help ## How It Works AskPablos Scrapy API intercepts requests with the `askpablos_api_map` in their meta dictionary and routes them through the AskPablos proxy service. The service can process requests using headless browsers and/or rotating proxies as specified, before returning the HTML response. Requests without the `askpablos_api_map` configuration bypass the processing entirely, giving you full control over which requests use the proxy service. ## Advanced Configuration ### All Available Options ```python # Request with all available options yield scrapy.Request( url="https://example.com", callback=self.parse, meta={ 'askpablos_api_map': { 'browser': True, # Use headless browser 'screenshot': True, # Take screenshot } } ) ``` ### SPA Interaction with Operations For Single Page Applications that require waiting for specific elements: ```python import scrapy class MySPASpider(scrapy.Spider): name = 'spa_example' def start_requests(self): yield scrapy.Request( url='https://spa-example.com', meta={ "askpablos_api_map": { "browser": True, "operations": [ { "task": "waitForElement", "match": { "on": "xpath", # or "css" "rule": "visible", # or "attached", "hidden", "detached" "value": "//*[@id='content-loaded']" }, "maxWait": 10, "onFailure": "return" # or "continue", "throw" } ] } }, callback=self.parse ) def parse(self, response): # Element has loaded before parsing data = response.css('.dynamic-content::text').getall() yield {'data': data} ``` See the [Detailed Usage Guide](usage.md) for more examples and complete documentation. ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Author Fawad Ali ([@fawadss1](https://github.com/fawadss1))