apify / crawlee-python Goto Github PK

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Home Page: https://crawlee.dev/python/

License: Apache License 2.0

Makefile 0.21% Python 86.34% JavaScript 10.53% CSS 2.71% Shell 0.20%

apify automation beautifulsoup crawler crawling headless headless-chrome pip playwright python

crawlee-python's People

Contributors

Stargazers

Watchers

Forkers

mantisus surendratamang ssteo chuckn408 megnidro freshy969 v0idmatr1x malsherlock amircp fourpartswater beimingmaster polya20 fauzaanu foleydom dragon28 marcoschaarbr aousabdo jade2290 ashleypng negro1907 utopic-dev shuaibibobo dfsosa83 sunlandli olabodejames mekongdelta-mind antarbou diogodsa techthiyanes khurramjaved1141 jecky100000 lowryel gaelkbertrand valeman marslanabdulrauf mbijon siddiqkaithodu qnegwwf danilovmy ymgyehlqo cleorp eagalon eros-rama kuntal-c partnerise lamardealmaker hbcbh1999 qianyouliang eltociear 10gtown10 xujinken prawi lynccrypto jfcollazo ratinabox jarandilla teresa0 cognicloudtech yyetkin68 iyetkin65 mojowebs xc0r mercuryyy anil2799 sorokinvld rezabehnoud daniel-ddv mohammadrezasoraya ehsansoraya starrm kekewolf ahmed-sabri aadityaverma linecode healthmemmo clarysf ysfyf alxsbr2411 joshdayax kotamadelin ysfadlaa laywookbarat alephdungeon goaclement goasilon minmin2411 gunungcrude sungaimimir ujungdunia istanalautb1ru pulauterlarang laywooktimur gunungtravia sungaiglasis hutansilon hesam7771 jorik041 j3din00b damon-choeng bravet

crawlee-python's Issues

Implement `Snapshotter._snapshot_client()`

Currently, there is only a dummy version of Snapshotter._snapshot_client() without a real measurement.
Once StorageClient is implemented, use it there to measure the real values.
Check TypeScript implementation for an inspiration - Snapshotter._snapshotClient().

Configure Renovate bot

Configure Renovate to keep Python dependencies up to date (poetry lock file, dev dependencies in pyproject.toml, ...) as we use it in JS/TS projects.

I expect that the Renovate bot will update dependencies at regular intervals. If they pass the tests, it will commit the changes directly to the master. Otherwise, it will open a pull request.

Once this is done, please open the same issue for SDK, Client, and Shared Python repositories with a link to this one.

Blocked by #6.

Implement `EventManager`

Implement the initial version.
EventManager in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/events/event_manager.ts.

Simplify code in `RequestQueue._ensure_head_is_non_empty`

Simplify code in RequestQueue._ensure_head_is_non_empty

https://github.com/apify/apify-sdk-python/blob/v1.3.0/src/apify/storages/request_queue.py#L428

Kick off the AutoscaledPool

Tasks

Explore the Apify/Crawlee AutoscaledPool in JavaScript (https://crawlee.dev/api/core/class/AutoscaledPool).
Figure out how to implement a similar functionality in Python.
Prepare a PoC of the AutoscaledPool for Python SDK.
Measure the performance of the PoCs

Test Actor for PoCs:

import asyncio
from dataclasses import dataclass, field
from time import time
from typing import Callable
from urllib.parse import urljoin

from apify import Actor
from apify.storages import RequestQueue
from bs4 import BeautifulSoup, Tag
from httpx import AsyncClient


@dataclass(frozen=True)
class ActorInput:
    start_urls: list[dict] = field(default_factory=lambda: [{'url': 'https://apify.com'}])
    max_depth: int = 1
    desired_concurrency: int = 10


class BeautifulSoupCrawler:
    def __init__(self, handle_request: Callable, max_depth: int, desired_concurrency: int) -> None:
        self.handle_request = handle_request
        self.max_depth = max_depth
        self.desired_concurrency = desired_concurrency

    async def run(self, start_urls: list) -> None:
        # Todo: every PoC will implement this differently


async def handle_request(request: dict, request_queue: RequestQueue, max_depth: int) -> None:
    url = request['url']
    depth = request['userData']['depth']
    Actor.log.info(f'Scraping {url} (depth={depth}) ...')

    try:
        async with AsyncClient() as client:
            response = await client.get(url, follow_redirects=True)
        soup = BeautifulSoup(response.content, 'html.parser')

        # If we haven't reached the max depth, look for nested links and enqueue their targets
        if depth < max_depth:
            for link in soup.find_all('a'):
                link_href = link.get('href')
                link_url = urljoin(url, link_href)
                if link_url.startswith(('http://', 'https://')):
                    Actor.log.info(f'Enqueuing {link_url} ...')
                    await request_queue.add_request(
                        {
                            'url': link_url,
                            'userData': {'depth': depth + 1},
                        }
                    )

        result = {
            'url': url,
            'title': soup.title.string if isinstance(soup.title, Tag) else None,
        }
        await Actor.push_data(result)

    except Exception:
        Actor.log.exception(f'Cannot extract data from {url}.')
    finally:
        # Mark the request as handled so it's not processed again
        await request_queue.mark_request_as_handled(request)


async def main() -> None:
    async with Actor:
        actor_input = ActorInput(**(await Actor.get_input() or {}))
        crawler = BeautifulSoupCrawler(handle_request, actor_input.max_depth, actor_input.desired_concurrency)
        start = time()
        await crawler.run(actor_input.start_urls)
        elapsed_time = time() - start
        Actor.log.info(f'Time taken: {elapsed_time}')

Add package for run-time type checking

Description

Based on the PR apify/apify-sdk-python#171, @janbuchar suggested the usage of some run-time checking for Python.

E.g. typeguard, it can be applied either using a decorator @typechecked for a specific function or import hook typeguard.install_import_hook() for the whole module.

For some methods/functions where we check manually the type of args/return type it could make sense to use it. E.g. here https://github.com/apify/apify-sdk-python/blob/v1.5.1/src/apify/scrapy/utils.py#L44.

Potential problems

I suppose it is implemented by using typing.get_type_hints for getting the type hints for a specific function. I run into a bug when typing.get_type_hints and from __future__ import annotations are used together, see the issue apify/apify-sdk-python#151. However, tests should reveal it.

Passing context to crawler handlers

The problem

We want to pick the best approach for passing context data/helpers to various handler functions in crawlee.py. We already have an implementation in place, but if there's a better way, we should rather do it sooner than later.

What OG Crawlee does

new CheerioCrawler({
  requestHandler: async ({ request, pushData, enqueueLinks }) => { // types of helpers are inferred correctly - thanks typescript
    // ...
  }
})

types are correct
we get suggestions for context helpers
implementation is iffy from a type-safety perspective, but salvageable

Python version A (+/- current implementation)

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(context: BeatifulSoupCrawlingContext) -> None:  # explicit type annotation is necessary for type checking and suggestions
  context.push_data(...)

if a type checker and annotations are used, types are correct (can't get better than that in Python)
we get suggestions for context helpers
the implementation is type-safe enough, but very inflexible
- in contrast to typescript, it won't be salvageable anytime near - not until we have intersection types
there is no "object destructuring" in Python, so everything needs to be prefixed with context.

Python version B

This proposal is similar to how pytest fixtures or FastAPI dependencies work.

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(push_data: PushData, soup: BeautifulSoup) -> None:  # explicit type annotation is necessary for type checking
  push_data(...)

no context. prefix
the function signature is not checked by a type checker, but we can do it when the handler is registered, which should be fine as well
allows for a more flexible implementation with easier code reuse
no suggestions of parameter names
the "injection" from Crawlee's side can be based on both parameter name and type annotation, so the type annotations are optional for users (but if they don't use it, they miss out on type safety and autocompletions)

Please voice your opinions on the matter 🙂 We also welcome any alternative approaches, of course.

Add tests for `RecurringTask` and `weighted_avg`

Better approach of making a cache

This is a follow-up issue to the discussion in #82 (comment).
Currently, we have our own implementation of LRU cache in crawlee/_utils/lru_cache.py.
Let's do it in a more Pythonic way, maybe utilizing the built-in caching from functools std module (lru_cache decorator)?

Add tests for `CrawleeLogFormatter`

Introduce a better solution for dealing with byte size

Current state

Currently, we have many variables describing "byte size" as integers. It leads to identifiers with the _bytes suffixes, e.g. max_memory_bytes, buffer_memory_bytes, threshold_memory_bytes, ...
Then we have to use some conversion functions, for example when we want to log it (e.g. to_mb function).

Goal state

Use some more clever ways of dealing with these kinds of variables. Similarly, we utilize datetime.timedelta.
Either by implementing our solution, e.g. like this:

# src/crawlee/_utils/byte_size.py

from dataclasses import dataclass

_BYTES_PER_KB = 1024
_BYTES_PER_MB = _BYTES_PER_KB**2
_BYTES_PER_GB = _BYTES_PER_KB**3
_BYTES_PER_TB = _BYTES_PER_KB**4

@dataclass
class ByteSize:
    """Represents a size in bytes."""

    bytes_: int

    def to_kb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_KB

    def to_mb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_MB

    def to_gb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_GB

    def to_tb(self: ByteSize) -> float:
        return self.bytes_ / _BYTES_PER_TB

    def __str__(self: ByteSize) -> str:
        if self.bytes_ >= _BYTES_PER_TB:
            return f'{self.to_tb():.2f} TB'

        if self.bytes_ >= _BYTES_PER_GB:
            return f'{self.to_gb():.2f} GB'

        if self.bytes_ >= _BYTES_PER_MB:
            return f'{self.to_mb():.2f} MB'

        if self.bytes_ >= _BYTES_PER_KB:
            return f'{self.to_kb():.2f} KB'

        return f'{self.bytes_} Bytes'

Or use some existing solution. Explore the following:
- typing.NewType;
- packages on PyPI.

System status reports "Total weight cannot be zero"

Crawlee v0.0.2
Script:

import asyncio
import logging
from bs4 import BeautifulSoup
from crawlee.log_config import CrawleeLogFormatter
from crawlee.http_crawler import HttpCrawler
from crawlee.http_crawler.types import HttpCrawlingContext
from crawlee.storages import Dataset, RequestList

logger = logging.getLogger()
handler = logging.StreamHandler()
handler.setFormatter(fmt=CrawleeLogFormatter())
logger.addHandler(hdlr=handler)
logger.setLevel(logging.DEBUG)


async def main() -> None:
    request_list = RequestList(['https://crawlee.dev'])
    crawler = HttpCrawler(request_provider=request_list)
    dataset = await Dataset.open()

    @crawler.router.default_handler
    async def handler(context: HttpCrawlingContext) -> None:
        status_code = context.http_response.status_code
        soup = BeautifulSoup(context.http_response.read(), 'lxml')
        title = soup.find('title').text
        result = {'url': context.http_response.url, 'title': title, 'status_code': status_code}
        print(f'Got the result: {result}, gonna push it to the dataset.')
        await dataset.push_data(result)

    await crawler.run()


if __name__ == '__main__':
    asyncio.run(main())

Resulting in

[asyncio] DEBUG Using selector: EpollSelector
[httpx] DEBUG load_ssl_context verify=True cert=None trust_env=True http2=False
[httpx] DEBUG load_verify_locations cafile='/home/vdusek/Projects/crawlee-py/.venv/lib/python3.12/site-packages/certifi/cacert.pem'
[crawlee.autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 3.84 GB.
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_snapshot_event_loop, delay=0:00:00.500000)...
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_snapshot_client, delay=0:00:01)...
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_log_system_status, delay=0:01:00)...
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_autoscale, delay=0:00:10)...
[crawlee._utils.recurring_task] DEBUG Calling RecurringTask.__init__(func=_emit_system_info_event, delay=0:01:00)...
[crawlee.autoscaling.autoscaled_pool] DEBUG Starting the pool
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee._utils.system] DEBUG Calling get_cpu_info()...
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0; client_info = 0
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] DEBUG Scheduling a new task
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] DEBUG Scheduling a new task
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] DEBUG Not scheduling new tasks - already running at desired concurrency
[crawlee.autoscaling.autoscaled_pool] DEBUG Worker task finished
[httpcore.connection] DEBUG connect_tcp.started host='crawlee.dev' port=443 local_address=None timeout=5.0 socket_options=None
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.system_status] WARN  Total weight cannot be zero
[crawlee.autoscaling.autoscaled_pool] DEBUG Not scheduling new task - no task is ready
[httpcore.connection] DEBUG connect_tcp.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x7f68ede856d0>
[httpcore.connection] DEBUG start_tls.started ssl_context=<ssl.SSLContext object at 0x7f68ede482d0> server_hostname='crawlee.dev' timeout=5.0
[httpcore.connection] DEBUG start_tls.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x7f68edd41550>
[httpcore.http11] DEBUG send_request_headers.started request=<Request [b'GET']>
[httpcore.http11] DEBUG send_request_headers.complete
[httpcore.http11] DEBUG send_request_body.started request=<Request [b'GET']>
[httpcore.http11] DEBUG send_request_body.complete
[httpcore.http11] DEBUG receive_response_headers.started request=<Request [b'GET']>
[httpcore.http11] DEBUG receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Connection', b'keep-alive'), (b'Content-Length', b'15939'), (b'Server', b'GitHub.com'), (b'Content-Type', b'text/html; charset=utf-8'), (b'Last-Modified', b'Thu, 11 Apr 2024 09:11:37 GMT'), (b'Access-Control-Allow-Origin', b'*'), (b'Strict-Transport-Security', b'max-age=31556952'), (b'ETag', b'W/"6617a949-11d3c"'), (b'expires', b'Thu, 11 Apr 2024 09:29:14 GMT'), (b'Cache-Control', b'max-age=600'), (b'Content-Encoding', b'gzip'), (b'x-proxy-cache', b'MISS'), (b'X-GitHub-Request-Id', b'79F2:30F74F:7BBC900:7DB5193:6617AB12'), (b'Accept-Ranges', b'bytes'), (b'Date', b'Thu, 11 Apr 2024 12:52:21 GMT'), (b'Via', b'1.1 varnish'), (b'Age', b'149'), (b'X-Served-By', b'cache-fra-etou8220138-FRA'), (b'X-Cache', b'HIT'), (b'X-Cache-Hits', b'1'), (b'X-Timer', b'S1712839941.239169,VS0,VE1'), (b'Vary', b'Accept-Encoding'), (b'X-Fastly-Request-ID', b'cef204eb5ba20a84be8334407996f7874dd39c5a')])
[httpx] INFO  HTTP Request: GET https://crawlee.dev "HTTP/1.1 200 OK"
[httpcore.http11] DEBUG receive_response_body.started request=<Request [b'GET']>
[httpcore.http11] DEBUG receive_response_body.complete
[httpcore.http11] DEBUG response_closed.started
[httpcore.http11] DEBUG response_closed.complete
Got the result: {'url': URL('https://crawlee.dev'), 'title': 'Crawlee · Build reliable crawlers. Fast. | Crawlee', 'status_code': 200}, gonna push it to the dataset.
[crawlee._utils.system] DEBUG Calling get_memory_info()...
[crawlee.autoscaling.autoscaled_pool] DEBUG Worker task finished
[crawlee.autoscaling.autoscaled_pool] DEBUG `is_finished_function` reports that we are finished
[crawlee.autoscaling.autoscaled_pool] DEBUG Terminating - no running tasks to wait for
[crawlee.autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.autoscaling.autoscaled_pool] DEBUG Pool cleanup finished

There is probably some issue in the Snapshotter / SystemStatus - investigate & fix it.

Generate changelog from the commit messages

Generate CHANGELOG from the commit messages as we do in JS/TS projects.

Once this is solved for this repository, please create the same issue in the SDK, Client, and Shared Python repositories.

@vladfrangu suggested to explore https://github.com/orhun/git-cliff as a solution to this.

Edit: also check this discussion: #125 (comment)

Implement session management

Implement the initial version.
Session management in TS Crawlee - https://github.com/apify/crawlee/tree/v3.8.2/packages/core/src/session_pool

Migrate to Poetry for packaging and dependency management

I believe we can use just poetry instead of pip, virtualenv, setuptools and twine (although, it uses some of them under the hood).

https://python-poetry.org/

Add context helpers for working with storages to `BasicCrawlingContext`

we should probably implement the "transactional" storage handling that AdaptivePlaywrightCrawler does in JS from the get go

Use `uv` as packaging tool used in CI builds

Recently, the creators of Ruff (Astral) released a new package installer and resolver called uv, written in Rust. Perhaps we could integrate it into our CI pipelines, as installing everything for all supported Python versions, as well as on Linux and Windows, can take some time.

This week, a similar approach was implemented in Apache Airflow: apache/airflow#37692.

Relase a first version of `crawlee` package to PyPI

Release just an empty package, just so no one takes that name.

Improve unit testing of Snapshotter

We're touching a lot of private stuff there, let's do it in a better way.

We discussed it in discussion_r1521267138.

Mainly

Or we could make a testing implementation of EventManager where emitting events could be done from the outside (I mean from the test).

is a good idea.

Add base storage client and resource subclients

Description

~~Currently, our resource clients are memory storage specific. Let's update them to be storage-agnostic. It will probably require the update of the BaseStorageClient & MemoryStorageClient as well.~~

Storage-agnostic resource clients are not an option regarding the structure of Apify (platform) clients. So instead of that, let's implement a unified interface (abstract base classes) for BaseStorage and all resource sub-clients (it will be based on the ApifyClient). All of the specific storage clients should inherit the base class and implement the relevant methods.

Soon we will have MemoryStorageClient, FileSystemStorageClient (probably extending the MemoryStorageClient), and ApifyStorageClient (implemented in the apify-sdk or in apify-client). All of them should implement an interface from BaseStorageClient.

StorageClientManager will take care of setting the specific StorageClient.

test

Implement auto-purging of storages

We need the same bahavior as with the JS version:

crawlee implements the base storage classes
every async operation checks if it was the first call, and purges automatically unless opted-out via CRAWLEE_PURGE_ON_START env var (with a falsy value like 0 or false)
we have this method that it called on many places in the storage methods like open or getInput https://crawlee.dev/api/core/function/purgeDefaultStorages
since the SDK uses those storage classes, it has the same behavior out of box
internally this works by calling purge method on the storage client, so this also means both memory storage and apify client need to implement this purge method

Related: apify/apify-cli#545

Optimize performance by skipping unnecessary `updateRequest()` calls in `RequestQueue.reclaim_request()`

Optimize performance by skipping unnecessary updateRequest() calls in RequestQueue.reclaim_request()

https://github.com/apify/apify-sdk-python/blob/v1.3.0/src/apify/storages/request_queue.py#L314:318

Publish crawlee package to conda

BasicCrawler statistics

Statistics shall be collected during the crawler run
BasicCrawler.run should return a (non-empty) statistics object
statistics should be logged periodically

Implement `LocalEventManager`

Implement the initial version.
LocalEventManager in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/events/local_event_manager.ts.

Implement fingerprinting

Coordinate with @barjin before implementing anything.

There is a possibility of developing a dedicated fingerprinting library (in Rust?). In that case, we will do just some wrapping in Python tooling (same in JavaScript).

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

chore(deps): update dependency typescript to v5.5.3

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

github-actions

.github/workflows/_check_changelog_entry.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_check_docs_build.yaml

actions/checkout v4

actions/setup-node v4

actions/setup-node v4

.github/workflows/_check_version_conflict.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_linting.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_publish_to_pypi.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_type_checking.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/_unit_tests.yaml

actions/checkout v4

actions/setup-python v5

.github/workflows/docs.yml

actions/checkout v4

actions/setup-node v4

actions/configure-pages v5

actions/upload-pages-artifact v3

actions/deploy-pages v4

.github/workflows/run_release.yaml

.github/workflows/update_new_issue.yaml

actions/github-script v7

npm

website/package.json

@apify/utilities ^2.8.0

@docusaurus/core 3.4.0

@docusaurus/mdx-loader 3.4.0

@docusaurus/plugin-client-redirects 3.4.0

@docusaurus/preset-classic 3.4.0

@giscus/react ^3.0.0

@mdx-js/react ^3.0.1

axios ^1.5.0

buffer ^6.0.3

clsx ^2.0.0

crypto-browserify ^3.12.0

docusaurus-gtm-plugin ^0.0.2

docusaurus-plugin-typedoc-api ^4.2.0

prism-react-renderer ^2.1.0

process ^0.11.10

prop-types ^15.8.1

raw-loader ^4.0.2

react ^18.2.0

react-dom ^18.2.0

react-lite-youtube-embed ^2.3.52

stream-browserify ^3.0.0

unist-util-visit ^5.0.0

@apify/eslint-config-ts ^0.4.0

@apify/tsconfig ^0.1.0

@docusaurus/module-type-aliases 3.4.0

@docusaurus/types 3.4.0

@types/react ^18.0.28

@typescript-eslint/eslint-plugin ^7.0.0

@typescript-eslint/parser ^7.0.0

eslint ^8.35.0

eslint-plugin-react ^7.32.2

eslint-plugin-react-hooks ^4.6.0

fs-extra ^11.1.0

patch-package ^8.0.0

path-browserify ^1.0.1

prettier ^3.0.0

rimraf ^5.0.0

typescript 5.5.2

yarn 4.3.1

website/roa-loader/package.json

loader-utils ^3.2.1

pep621

pyproject.toml

templates/beautifulsoup/{{cookiecutter.project_name}}/pyproject.toml

templates/playwright/{{cookiecutter.project_name}}/pyproject.toml

poetry

pyproject.toml

python ^3.9

aiofiles ^23.2.1

aioshutil ^1.3

beautifulsoup4 ^4.12.3

colorama ^0.4.6

docutils ^0.21.0

eval-type-backport ^0.2.0

html5lib ^1.1

httpx ^0.27.0

lxml ^5.2.1

more_itertools ^10.2.0

playwright ^1.43.0

psutil ^6.0.0

pydantic ^2.6.3

pydantic-settings ^2.2.1

pyee ^11.1.0

python-dateutil ^2.9.0

sortedcollections ^2.1.0

typing-extensions ^4.1.0

tldextract ^5.1.2

cookiecutter ^2.6.0

typer ^0.12.3

inquirer ^3.3.0

build ~1.2.0

filelock ~3.15.0

ipdb ^0.13.13

mypy ~1.10.0

pre-commit ~3.7.0

pydoc-markdown ~4.8.2

pytest ~8.2.0

pytest-asyncio ~0.23.5

pytest-cov ~5.0.0

pytest-only ~2.1.0

pytest-timeout ~2.3.0

pytest-xdist ~3.6.0

respx ~0.21.0

ruff ~0.5.0

setuptools ^70.0.0

proxy-py ^2.4.4

templates/beautifulsoup/{{cookiecutter.project_name}}/pyproject.toml

python ^3.9

crawlee *

templates/playwright/{{cookiecutter.project_name}}/pyproject.toml

python ^3.9

crawlee *

Check this box to trigger a request for Renovate to run again on this repository

Implement initial version of `BrowserPool` and `PlaywrightCrawler`

Implement the initial version with basic features.
Codebase in Crawlee should serve as an inspiration:
- BrowserPool: https://github.com/apify/crawlee/tree/v3.10.1/packages/browser-pool
- PlaywrightCrawler: https://github.com/apify/crawlee/tree/v3.10.1/packages/playwright-crawler
Update documentation in readme with the new crawler.

Implement `BasicCrawler`

Implement the initial version.
BasicCrawler in Crawlee should serve as inspiration - https://github.com/apify/crawlee/tree/v3.7.3/packages/basic-crawler.

Handle configuration (env. variables) using `pydantic-settings`

In code migrated from the SDK, there is a homebrewed solution - let's use something more standard.

Request queue v2 support

Implement methods for request queue v2 (locking, batch operations)
Implement request queue v2 into local request queue (locking, batch operations)
On top of that there is a difference between Python and js clients, we are missing parallelism and retries in python client so we need to implement in into sdk

Explore what doc tooling we use in SDK and how it deals with dataclasses docstrings

Let's consider the following example:

@dataclass
class MemorySnapshot:
    """A snapshot of memory usage.

    Args:
        total_bytes: Total memory available in the system.
        current_bytes: Memory usage of the current Python process and its children.
        max_memory_bytes: The maximum memory that can be used by `AutoscaledPool`.
        max_used_memory_ratio: The maximum acceptable ratio of `current_bytes` to `max_memory_bytes`.
        created_at: The time at which the measurement was taken.
    """

    total_bytes: int
    current_bytes: int
    max_memory_bytes: int
    max_used_memory_ratio: float
    created_at: datetime = field(default_factory=lambda: datetime.now(tz=timezone.utc))

    @property
    def is_overloaded(self) -> bool:
        """Returns whether the memory is considered as overloaded."""
        return (self.current_bytes / self.max_memory_bytes) > self.max_used_memory_ratio

Is doc tooling (maybe the one we use in SDK) able to handle it properly?

Based on the discussion in here #20 (comment).

Separate `MemoryStorageClient` and `FilesystemStorageClient`

Description

Currently, we have a MemoryStorageClient, that can persist the data in the file system.

Let's separate them, FilesystemStorageClient could probably extend MemoryStorageClient

Other related things

There are memory storage-only data models in the storage/models.py module. Move them to the memory storage subpackage.

Implement `Snapshotter`

Snapshotter will be a class for resource monitoring in the autoscaling subpackage.
Implement the initial version.
Snapshotter in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/autoscaling/snapshotter.ts.

Implement local storages & storage clients

Implement the initial version.
This should be already implemented in the SDK, at least for the most part. Check the following subpackages:
- src/apify/storages
- src/apify/_memory_storage
Storages in TS Crawlee - https://github.com/apify/crawlee/tree/master/packages/core/src/storages.
Memory storage and resource clients in TS Crawlee - https://github.com/apify/crawlee/tree/master/packages/memory-storage/src.

Verify accuracy of CPU load measurement in `LocalEventManager`

Once we have a non-trivial scraper in place, we should make sure that we can measure its CPU usage correctly. With playwright, for example, it might happen that the CPU usage of the browser process is not taken into account.

BasicCrawler status logging

configurable interval
configurable status message callback (constructor parameter, property or decorator?)
we periodically set the crawler status via storage client
in javascript crawlee, this does nothing when MemoryStorage is being used

Batched request addition in `RequestQueue`

Thus far, there's just a dummy implementation.

Add tests for `LocalEventManager`

Correctly implement `RequestList`

The current implementation is very basic and mostly serves for testing. We should make it more like https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts

Simplify argument type `requests`

Somewhere we use the following:

requests: list[BaseRequestData | Request]

Let's refactor the code to accept only one type.

On the places where we need to use:

arg_name: list[Request | str]

Let's use a different identifier than requests, e.g. sources.

See the following conversation for context - #56 (comment).

Remove `json_` and `order_no` from `Request`

The purpose of the fields is somewhat unclear, but it's certain that they don't belong to the Request class.

We should definitely explore the notion of an internal request in Crawlee and how it translates to the Python version.

Implement `AutoscaledPool`

Implement the initial version.
An initial investigation was already done at #3.
AutoscaledPool in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/autoscaling/autoscaled_pool.ts.

Add `enqueue_links` helper

We should provide a similar helper to what we have in crawlee.

https://crawlee.dev/api/core/function/enqueueLinks

In a nutshell, there is base implementation, which requires a list of URLs, filters them based on the provided options (e.g. globs/regexps or the enqueue strategies) and adds them to the RQ. Then we have contextual helpers in each crawler, e.g. CheerioCrawler has its own context-aware variant, which operates on the current page, and automatically finds all the links (matching the selector option, which defaults to just a).

The enqueuing strategies are described here:

https://crawlee.dev/api/core/enum/EnqueueStrategy

We should first come up with the basic support for autoscaling, and have a BasicCrawler and BeautifulsoupCrawler classes.

We could start with a simple variant that will only work with regexps, and add more features/options going forward.

Add tests for `EventManager`

Implement `BeautifulSoupCrawler`

Implement the initial version.
CheerioCrawler in Crawlee should serve as inspiration - https://github.com/apify/crawlee/tree/v3.7.3/packages/cheerio-crawler.
Cheerio is a similar library for JS/TS as BeautifulSoup is for Python.

Implement `SystemStatus`

Implement the initial version.
SystemStatus in Crawlee should serve as inspiration - https://github.com/apify/crawlee/blob/v3.7.3/packages/core/src/autoscaling/system_status.ts.

Implement switching between storage implementations based on autodetection

When Apify platform is detected, we should use SDK implementations. Otherwise, fall back to local storage.

Refactor initialization of storages

Description

Currently, if you want to initialize Dataset/KVS/RQ you should use open() constructor. And it goes like the following:
- dataset.open()
- base_storage.open()
- dataset.__init__()
- base_storage.__init__()
In the base_storage.open() a specific client is selected (local - MemoryStorageClient or cloud - ApifyClient) using StorageClientManager.
Refactor initialization of memory storage resource clients as well.

Desired state

Make it more readable, less error-prone (e.g. user uses a wrong constructor), and extensible by supporting other clients.

apify / crawlee-python Goto Github PK

crawlee-python's People

Contributors

Stargazers

Watchers

Forkers

crawlee-python's Issues

Tasks

Test Actor for PoCs:

Description

Potential problems

The problem

What OG Crawlee does

Python version A (+/- current implementation)

Python version B

Current state

Goal state

Description

Open

Ignored or Blocked

Detected dependencies

Description

Other related things

Description

Desired state

Recommend Projects

Recommend Topics

Recommend Org