Code Monkey home page Code Monkey logo

apify / crawlee-python Goto Github PK

View Code? Open in Web Editor NEW
3.7K 26.0 245.0 21.74 MB

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Home Page: https://crawlee.dev/python/

License: Apache License 2.0

Makefile 0.21% Python 86.34% JavaScript 10.53% CSS 2.71% Shell 0.20%
apify automation beautifulsoup crawler crawling headless headless-chrome pip playwright python

crawlee-python's People

Contributors

asymness avatar b4nan avatar barjin avatar eltociear avatar fauzaanu avatar janbuchar avatar kpcofgs avatar mantisus avatar renovate[bot] avatar siddiqkaithodu avatar souravjain540 avatar tymeek avatar vdusek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawlee-python's Issues

Configure Renovate bot

Configure Renovate to keep Python dependencies up to date (poetry lock file, dev dependencies in pyproject.toml, ...) as we use it in JS/TS projects.

I expect that the Renovate bot will update dependencies at regular intervals. If they pass the tests, it will commit the changes directly to the master. Otherwise, it will open a pull request.

Once this is done, please open the same issue for SDK, Client, and Shared Python repositories with a link to this one.

Blocked by #6.

Kick off the AutoscaledPool

Tasks

  • Explore the Apify/Crawlee AutoscaledPool in JavaScript (https://crawlee.dev/api/core/class/AutoscaledPool).
  • Figure out how to implement a similar functionality in Python.
  • Prepare a PoC of the AutoscaledPool for Python SDK.
  • Measure the performance of the PoCs

Test Actor for PoCs:

import asyncio
from dataclasses import dataclass, field
from time import time
from typing import Callable
from urllib.parse import urljoin

from apify import Actor
from apify.storages import RequestQueue
from bs4 import BeautifulSoup, Tag
from httpx import AsyncClient


@dataclass(frozen=True)
class ActorInput:
    start_urls: list[dict] = field(default_factory=lambda: [{'url': 'https://apify.com'}])
    max_depth: int = 1
    desired_concurrency: int = 10


class BeautifulSoupCrawler:
    def __init__(self, handle_request: Callable, max_depth: int, desired_concurrency: int) -> None:
        self.handle_request = handle_request
        self.max_depth = max_depth
        self.desired_concurrency = desired_concurrency

    async def run(self, start_urls: list) -> None:
        # Todo: every PoC will implement this differently


async def handle_request(request: dict, request_queue: RequestQueue, max_depth: int) -> None:
    url = request['url']
    depth = request['userData']['depth']
    Actor.log.info(f'Scraping {url} (depth={depth}) ...')

    try:
        async with AsyncClient() as client:
            response = await client.get(url, follow_redirects=True)
        soup = BeautifulSoup(response.content, 'html.parser')

        # If we haven't reached the max depth, look for nested links and enqueue their targets
        if depth < max_depth:
            for link in soup.find_all('a'):
                link_href = link.get('href')
                link_url = urljoin(url, link_href)
                if link_url.startswith(('http://', 'https://')):
                    Actor.log.info(f'Enqueuing {link_url} ...')
                    await request_queue.add_request(
                        {
                            'url': link_url,
                            'userData': {'depth': depth + 1},
                        }
                    )

        result = {
            'url': url,
            'title': soup.title.string if isinstance(soup.title, Tag) else None,
        }
        await Actor.push_data(result)

    except Exception:
        Actor.log.exception(f'Cannot extract data from {url}.')
    finally:
        # Mark the request as handled so it's not processed again
        await request_queue.mark_request_as_handled(request)


async def main() -> None:
    async with Actor:
        actor_input = ActorInput(**(await Actor.get_input() or {}))
        crawler = BeautifulSoupCrawler(handle_request, actor_input.max_depth, actor_input.desired_concurrency)
        start = time()
        await crawler.run(actor_input.start_urls)
        elapsed_time = time() - start
        Actor.log.info(f'Time taken: {elapsed_time}')

Add package for run-time type checking

Description

Based on the PR apify/apify-sdk-python#171, @janbuchar suggested the usage of some run-time checking for Python.

E.g. typeguard, it can be applied either using a decorator @typechecked for a specific function or import hook typeguard.install_import_hook() for the whole module.

For some methods/functions where we check manually the type of args/return type it could make sense to use it. E.g. here https://github.com/apify/apify-sdk-python/blob/v1.5.1/src/apify/scrapy/utils.py#L44.

Potential problems

I suppose it is implemented by using typing.get_type_hints for getting the type hints for a specific function. I run into a bug when typing.get_type_hints and from __future__ import annotations are used together, see the issue apify/apify-sdk-python#151. However, tests should reveal it.

Passing context to crawler handlers

The problem

We want to pick the best approach for passing context data/helpers to various handler functions in crawlee.py. We already have an implementation in place, but if there's a better way, we should rather do it sooner than later.

What OG Crawlee does

new CheerioCrawler({
  requestHandler: async ({ request, pushData, enqueueLinks }) => { // types of helpers are inferred correctly - thanks typescript
    // ...
  }
})
  • types are correct
  • we get suggestions for context helpers
  • implementation is iffy from a type-safety perspective, but salvageable

Python version A (+/- current implementation)

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(context: BeatifulSoupCrawlingContext) -> None:  # explicit type annotation is necessary for type checking and suggestions
  context.push_data(...)
  • if a type checker and annotations are used, types are correct (can't get better than that in Python)
  • we get suggestions for context helpers
  • the implementation is type-safe enough, but very inflexible
    • in contrast to typescript, it won't be salvageable anytime near - not until we have intersection types
  • there is no "object destructuring" in Python, so everything needs to be prefixed with context.

Python version B

This proposal is similar to how pytest fixtures or FastAPI dependencies work.

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(push_data: PushData, soup: BeautifulSoup) -> None:  # explicit type annotation is necessary for type checking
  push_data(...)
  • no context. prefix
  • the function signature is not checked by a type checker, but we can do it when the handler is registered, which should be fine as well
  • allows for a more flexible implementation with easier code reuse
  • no suggestions of parameter names
  • the "injection" from Crawlee's side can be based on both parameter name and type annotation, so the type annotations are optional for users (but if they don't use it, they miss out on type safety and autocompletions)

Please voice your opinions on the matter 🙂 We also welcome any alternative approaches, of course.

Better approach of making a cache

  • This is a follow-up issue to the discussion in #82 (comment).
  • Currently, we have our own implementation of LRU cache in crawlee/_utils/lru_cache.py.
  • Let's do it in a more Pythonic way, maybe utilizing the built-in caching from functools std module (lru_cache decorator)?

Introduce a better solution for dealing with byte size

Current state

  • Currently, we have many variables describing "byte size" as integers. It leads to identifiers with the _bytes suffixes, e.g. max_memory_bytes, buffer_memory_bytes, threshold_memory_bytes, ...
  • Then we have to use some conversion functions, for example when we want to log it (e.g. to_mb function).

Goal state

  • Use some more clever ways of dealing with these kinds of variables. Similarly, we utilize datetime.timedelta.
  • Either by implementing our solution, e.g. like this:
# src/crawlee/_utils/byte_size.py

from dataclasses import dataclass

_BYTES_PER_KB = 1024
_BYTES_PER_MB = _BYTES_PER_KB**2
_BYTES_PER_GB = _BYTES_PER_KB**3
_BYTES_PER_TB = _BYTES_PER_KB**4

@dataclass
class ByteSize:
    """Represents a size in bytes."""

    bytes_: int

    def to_kb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_KB

    def to_mb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_MB

    def to_gb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_GB

    def to_tb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_TB

    def __str__(self: ByteSize) -> str:
        if self.bytes_ >= _BYTES_PER_TB:
            return f'{self.to_tb():.2f} TB'

        if self.bytes_ >= _BYTES_PER_GB:
            return f'{self.to_gb():.2f} GB'

        if self.bytes_ >= _BYTES_PER_MB:
            return f'{self.to_mb():.2f} MB'

        if self.bytes_ >= _BYTES_PER_KB:
            return f'{self.to_kb():.2f} KB'

        return f'{self.bytes_} Bytes'
  • Or use some existing solution. Explore the following:
    • typing.NewType;
    • packages on PyPI.

System status reports "Total weight cannot be zero"

  • Crawlee v0.0.2
  • Script:
import asyncio
import logging
from bs4 import BeautifulSoup
from crawlee.log_config import CrawleeLogFormatter
from crawlee.http_crawler import HttpCrawler
from crawlee.http_crawler.types import HttpCrawlingContext
from crawlee.storages import Dataset, RequestList

logger = logging.getLogger()
handler = logging.StreamHandler()
handler.setFormatter(fmt=CrawleeLogFormatter())
logger.addHandler(hdlr=handler)
logger.setLevel(logging.DEBUG)


async def main() -> None:
    request_list = RequestList(['https://crawlee.dev'])
    crawler = HttpCrawler(request_provider=request_list)
    dataset = await Dataset.open()

    @crawler.router.default_handler
    async def handler(context: HttpCrawlingContext) -> None:
        status_code = context.http_response.status_code
        soup = BeautifulSoup(context.http_response.read(), 'lxml')
        title = soup.find('title').text
        result = {'url': context.http_response.url, 'title': title, 'status_code': status_code}
        print(f'Got the result: {result}, gonna push it to the dataset.')
        await dataset.push_data(result)

    await crawler.run()


if __name__ == '__main__':
    asyncio.run(main())
  • Resulting in
[asyncio] DEBUG Using selector: EpollSelector
[httpx] DEBUG load_ssl_context verify=True cert=None trust_env=True http2=False
[httpx] DEBUG load_verify_locations cafile='/home/vdusek/Projects/crawlee-py/.venv/lib/python3.12/site-packages/certifi/cacert.pem'
[crawlee.autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 3.84 GB.
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_snapshot_event_loop, delay=0:00:00.500000)...
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_snapshot_client, delay=0:00:01)...
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_log_system_status, delay=0:01:00)...
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_autoscale, delay=0:00:10)...
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_emit_system_info_event, delay=0:01:00)...
[crawlee.autoscaling.autoscaled_pool] DEBUG Starting the pool
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee._utils.system] DEBUG Calling get_cpu_info()...
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0; client_info = 0
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] DEBUG Scheduling a new task
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] DEBUG Scheduling a new task
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] DEBUG Not scheduling new tasks - already running at desired concurrency
[crawlee.autoscaling.autoscaled_pool] DEBUG Worker task finished
[httpcore.connection] DEBUG connect_tcp.started host='crawlee.dev' port=443 local_address=None timeout=5.0 socket_options=None
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] DEBUG Not scheduling new task - no task is ready
[httpcore.connection] DEBUG connect_tcp.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x7f68ede856d0>
[httpcore.connection] DEBUG start_tls.started ssl_context=<ssl.SSLContext object at 0x7f68ede482d0> server_hostname='crawlee.dev' timeout=5.0
[httpcore.connection] DEBUG start_tls.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x7f68edd41550>
[httpcore.http11] DEBUG send_request_headers.started request=<Request [b'GET']>
[httpcore.http11] DEBUG send_request_headers.complete
[httpcore.http11] DEBUG send_request_body.started request=<Request [b'GET']>
[httpcore.http11] DEBUG send_request_body.complete
[httpcore.http11] DEBUG receive_response_headers.started request=<Request [b'GET']>
[httpcore.http11] DEBUG receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Connection', b'keep-alive'), (b'Content-Length', b'15939'), (b'Server', b'GitHub.com'), (b'Content-Type', b'text/html; charset=utf-8'), (b'Last-Modified', b'Thu, 11 Apr 2024 09:11:37 GMT'), (b'Access-Control-Allow-Origin', b'*'), (b'Strict-Transport-Security', b'max-age=31556952'), (b'ETag', b'W/"6617a949-11d3c"'), (b'expires', b'Thu, 11 Apr 2024 09:29:14 GMT'), (b'Cache-Control', b'max-age=600'), (b'Content-Encoding', b'gzip'), (b'x-proxy-cache', b'MISS'), (b'X-GitHub-Request-Id', b'79F2:30F74F:7BBC900:7DB5193:6617AB12'), (b'Accept-Ranges', b'bytes'), (b'Date', b'Thu, 11 Apr 2024 12:52:21 GMT'), (b'Via', b'1.1 varnish'), (b'Age', b'149'), (b'X-Served-By', b'cache-fra-etou8220138-FRA'), (b'X-Cache', b'HIT'), (b'X-Cache-Hits', b'1'), (b'X-Timer', b'S1712839941.239169,VS0,VE1'), (b'Vary', b'Accept-Encoding'), (b'X-Fastly-Request-ID', b'cef204eb5ba20a84be8334407996f7874dd39c5a')])
[httpx] INFO  HTTP Request: GET https://crawlee.dev "HTTP/1.1 200 OK"
[httpcore.http11] DEBUG receive_response_body.started request=<Request [b'GET']>
[httpcore.http11] DEBUG receive_response_body.complete
[httpcore.http11] DEBUG response_closed.started
[httpcore.http11] DEBUG response_closed.complete
Got the result: {'url': URL('https://crawlee.dev'), 'title': 'Crawlee · Build reliable crawlers. Fast. | Crawlee', 'status_code': 200}, gonna push it to the dataset.
[crawlee._utils.system] DEBUG Calling get_memory_info()...
[crawlee.autoscaling.autoscaled_pool] DEBUG Worker task finished
[crawlee.autoscaling.autoscaled_pool] DEBUG `is_finished_function` reports that we are finished
[crawlee.autoscaling.autoscaled_pool] DEBUG Terminating - no running tasks to wait for
[crawlee.autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.autoscaling.autoscaled_pool] DEBUG Pool cleanup finished
  • There is probably some issue in the Snapshotter / SystemStatus - investigate & fix it.

Use `uv` as packaging tool used in CI builds

Recently, the creators of Ruff (Astral) released a new package installer and resolver called uv, written in Rust. Perhaps we could integrate it into our CI pipelines, as installing everything for all supported Python versions, as well as on Linux and Windows, can take some time.

This week, a similar approach was implemented in Apache Airflow: apache/airflow#37692.

Improve unit testing of Snapshotter

We're touching a lot of private stuff there, let's do it in a better way.

We discussed it in discussion_r1521267138.

Mainly

Or we could make a testing implementation of EventManager where emitting events could be done from the outside (I mean from the test).

is a good idea.

Add base storage client and resource subclients

Description

Currently, our resource clients are memory storage specific. Let's update them to be storage-agnostic. It will probably require the update of the BaseStorageClient & MemoryStorageClient as well.

Storage-agnostic resource clients are not an option regarding the structure of Apify (platform) clients. So instead of that, let's implement a unified interface (abstract base classes) for BaseStorage and all resource sub-clients (it will be based on the ApifyClient). All of the specific storage clients should inherit the base class and implement the relevant methods.

Soon we will have MemoryStorageClient, FileSystemStorageClient (probably extending the MemoryStorageClient), and ApifyStorageClient (implemented in the apify-sdk or in apify-client). All of them should implement an interface from BaseStorageClient.

StorageClientManager will take care of setting the specific StorageClient.

Implement auto-purging of storages

We need the same bahavior as with the JS version:

  • crawlee implements the base storage classes
  • every async operation checks if it was the first call, and purges automatically unless opted-out via CRAWLEE_PURGE_ON_START env var (with a falsy value like 0 or false)
  • we have this method that it called on many places in the storage methods like open or getInput https://crawlee.dev/api/core/function/purgeDefaultStorages
  • since the SDK uses those storage classes, it has the same behavior out of box
  • internally this works by calling purge method on the storage client, so this also means both memory storage and apify client need to implement this purge method

Related: apify/apify-cli#545

BasicCrawler statistics

  • Statistics shall be collected during the crawler run
  • BasicCrawler.run should return a (non-empty) statistics object
  • statistics should be logged periodically

Implement fingerprinting

Coordinate with @barjin before implementing anything.

There is a possibility of developing a dedicated fingerprinting library (in Rust?). In that case, we will do just some wrapping in Python tooling (same in JavaScript).

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

github-actions
.github/workflows/_check_changelog_entry.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_check_docs_build.yaml
  • actions/checkout v4
  • actions/setup-node v4
  • actions/setup-node v4
.github/workflows/_check_version_conflict.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_linting.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_publish_to_pypi.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_type_checking.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/_unit_tests.yaml
  • actions/checkout v4
  • actions/setup-python v5
.github/workflows/docs.yml
  • actions/checkout v4
  • actions/setup-node v4
  • actions/configure-pages v5
  • actions/upload-pages-artifact v3
  • actions/deploy-pages v4
.github/workflows/run_release.yaml
.github/workflows/update_new_issue.yaml
  • actions/github-script v7
npm
website/package.json
  • @apify/utilities ^2.8.0
  • @docusaurus/core 3.4.0
  • @docusaurus/mdx-loader 3.4.0
  • @docusaurus/plugin-client-redirects 3.4.0
  • @docusaurus/preset-classic 3.4.0
  • @giscus/react ^3.0.0
  • @mdx-js/react ^3.0.1
  • axios ^1.5.0
  • buffer ^6.0.3
  • clsx ^2.0.0
  • crypto-browserify ^3.12.0
  • docusaurus-gtm-plugin ^0.0.2
  • docusaurus-plugin-typedoc-api ^4.2.0
  • prism-react-renderer ^2.1.0
  • process ^0.11.10
  • prop-types ^15.8.1
  • raw-loader ^4.0.2
  • react ^18.2.0
  • react-dom ^18.2.0
  • react-lite-youtube-embed ^2.3.52
  • stream-browserify ^3.0.0
  • unist-util-visit ^5.0.0
  • @apify/eslint-config-ts ^0.4.0
  • @apify/tsconfig ^0.1.0
  • @docusaurus/module-type-aliases 3.4.0
  • @docusaurus/types 3.4.0
  • @types/react ^18.0.28
  • @typescript-eslint/eslint-plugin ^7.0.0
  • @typescript-eslint/parser ^7.0.0
  • eslint ^8.35.0
  • eslint-plugin-react ^7.32.2
  • eslint-plugin-react-hooks ^4.6.0
  • fs-extra ^11.1.0
  • patch-package ^8.0.0
  • path-browserify ^1.0.1
  • prettier ^3.0.0
  • rimraf ^5.0.0
  • typescript 5.5.2
  • yarn 4.3.1
website/roa-loader/package.json
  • loader-utils ^3.2.1
pep621
pyproject.toml
templates/beautifulsoup/{{cookiecutter.project_name}}/pyproject.toml
templates/playwright/{{cookiecutter.project_name}}/pyproject.toml
poetry
pyproject.toml
  • python ^3.9
  • aiofiles ^23.2.1
  • aioshutil ^1.3
  • beautifulsoup4 ^4.12.3
  • colorama ^0.4.6
  • docutils ^0.21.0
  • eval-type-backport ^0.2.0
  • html5lib ^1.1
  • httpx ^0.27.0
  • lxml ^5.2.1
  • more_itertools ^10.2.0
  • playwright ^1.43.0
  • psutil ^6.0.0
  • pydantic ^2.6.3
  • pydantic-settings ^2.2.1
  • pyee ^11.1.0
  • python-dateutil ^2.9.0
  • sortedcollections ^2.1.0
  • typing-extensions ^4.1.0
  • tldextract ^5.1.2
  • cookiecutter ^2.6.0
  • typer ^0.12.3
  • inquirer ^3.3.0
  • build ~1.2.0
  • filelock ~3.15.0
  • ipdb ^0.13.13
  • mypy ~1.10.0
  • pre-commit ~3.7.0
  • pydoc-markdown ~4.8.2
  • pytest ~8.2.0
  • pytest-asyncio ~0.23.5
  • pytest-cov ~5.0.0
  • pytest-only ~2.1.0
  • pytest-timeout ~2.3.0
  • pytest-xdist ~3.6.0
  • respx ~0.21.0
  • ruff ~0.5.0
  • setuptools ^70.0.0
  • proxy-py ^2.4.4
templates/beautifulsoup/{{cookiecutter.project_name}}/pyproject.toml
  • python ^3.9
  • crawlee *
templates/playwright/{{cookiecutter.project_name}}/pyproject.toml
  • python ^3.9
  • crawlee *

  • Check this box to trigger a request for Renovate to run again on this repository

Request queue v2 support

  • Implement methods for request queue v2 (locking, batch operations)
  • Implement request queue v2 into local request queue (locking, batch operations)
  • On top of that there is a difference between Python and js clients, we are missing parallelism and retries in python client so we need to implement in into sdk

Explore what doc tooling we use in SDK and how it deals with dataclasses docstrings

Let's consider the following example:

@dataclass
class MemorySnapshot:
    """A snapshot of memory usage.

    Args:
        total_bytes: Total memory available in the system.
        current_bytes: Memory usage of the current Python process and its children.
        max_memory_bytes: The maximum memory that can be used by `AutoscaledPool`.
        max_used_memory_ratio: The maximum acceptable ratio of `current_bytes` to `max_memory_bytes`.
        created_at: The time at which the measurement was taken.
    """

    total_bytes: int
    current_bytes: int
    max_memory_bytes: int
    max_used_memory_ratio: float
    created_at: datetime = field(default_factory=lambda: datetime.now(tz=timezone.utc))

    @property
    def is_overloaded(self) -> bool:
        """Returns whether the memory is considered as overloaded."""
        return (self.current_bytes / self.max_memory_bytes) > self.max_used_memory_ratio

Is doc tooling (maybe the one we use in SDK) able to handle it properly?

Based on the discussion in here #20 (comment).

Separate `MemoryStorageClient` and `FilesystemStorageClient`

Description

Currently, we have a MemoryStorageClient, that can persist the data in the file system.

Let's separate them, FilesystemStorageClient could probably extend MemoryStorageClient

Other related things

  • There are memory storage-only data models in the storage/models.py module. Move them to the memory storage subpackage.

BasicCrawler status logging

  • configurable interval
  • configurable status message callback (constructor parameter, property or decorator?)
  • we periodically set the crawler status via storage client
  • in javascript crawlee, this does nothing when MemoryStorage is being used

Simplify argument type `requests`

Somewhere we use the following:

requests: list[BaseRequestData | Request]

Let's refactor the code to accept only one type.

On the places where we need to use:

arg_name: list[Request | str]

Let's use a different identifier than requests, e.g. sources.

See the following conversation for context - #56 (comment).

Remove `json_` and `order_no` from `Request`

The purpose of the fields is somewhat unclear, but it's certain that they don't belong to the Request class.

We should definitely explore the notion of an internal request in Crawlee and how it translates to the Python version.

Add `enqueue_links` helper

We should provide a similar helper to what we have in crawlee.

https://crawlee.dev/api/core/function/enqueueLinks

In a nutshell, there is base implementation, which requires a list of URLs, filters them based on the provided options (e.g. globs/regexps or the enqueue strategies) and adds them to the RQ. Then we have contextual helpers in each crawler, e.g. CheerioCrawler has its own context-aware variant, which operates on the current page, and automatically finds all the links (matching the selector option, which defaults to just a).

The enqueuing strategies are described here:

https://crawlee.dev/api/core/enum/EnqueueStrategy

We should first come up with the basic support for autoscaling, and have a BasicCrawler and BeautifulsoupCrawler classes.

We could start with a simple variant that will only work with regexps, and add more features/options going forward.

Refactor initialization of storages

Description

  • Currently, if you want to initialize Dataset/KVS/RQ you should use open() constructor. And it goes like the following:
    • dataset.open()
    • base_storage.open()
    • dataset.__init__()
    • base_storage.__init__()
  • In the base_storage.open() a specific client is selected (local - MemoryStorageClient or cloud - ApifyClient) using StorageClientManager.
  • Refactor initialization of memory storage resource clients as well.

Desired state

  • Make it more readable, less error-prone (e.g. user uses a wrong constructor), and extensible by supporting other clients.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.