Code Monkey home page Code Monkey logo

hibp-downloader's Introduction

hibp-downloader

pypi python build tests docs license

This is a CLI tool to efficiently download a local copy of the pwned password hash data from the very awesome HIBP pwned passwords api-endpoint using all the good bits; multiprocessing, async-processes, local-caching, content-etags and http2-connection pooling to probably make things as fast as is Pythonly possible.

Features

  • Interface to directly query for compromised password values from the compressed file data-store!
  • Download and store acquired data in gzip'd compressed to save on storage and speed up queries.
  • Download the full dataset in under 45 mins (generally CPU bound)
  • Easily resume interrupted download operations into a --data-path without re-clobbering api-source.
  • Only download hash-prefix content blocks when the source content has changed (via content ETAG values); making it easy to periodically sync-up when needed.
  • Query interface performance is efficient enough to attach a user web-service with reasonable loads (ie don't waste your own resources decompressing the dataset and storing in a database!)
  • Ability to generate a single text file with in-order pwned password hash values, similar to PwnedPasswordsDownloader from the awesome HIBP team.
  • Per prefix file metadata in JSON format for easy data reuse by other tooling if required.

Install

pipx install hibp-downloader

Usage (download)

screenshot-help.png

Performance

Sample download activity log; host with 32 cores on 500Mbit/s connection.

...
2024-05-16T10:18:01-0400 | INFO | hibp-downloader | prefix=f80c7 source=[lc:13616 et:3 rc:1002358 ro:25 xx:1] processed=[17836.6MB ~414462H/s] api=[918req/s 17597.4MB] runtime=36.4min
2024-05-16T10:18:02-0400 | INFO | hibp-downloader | prefix=f81af source=[lc:13616 et:3 rc:1002558 ro:25 xx:1] processed=[17840.1MB ~414454H/s] api=[918req/s 17600.9MB] runtime=36.4min
2024-05-16T10:18:02-0400 | INFO | hibp-downloader | prefix=f826f source=[lc:13616 et:3 rc:1002758 ro:25 xx:1] processed=[17843.6MB ~414454H/s] api=[918req/s 17604.4MB] runtime=36.4min
2024-05-16T10:18:03-0400 | INFO | hibp-downloader | prefix=f833f source=[lc:13616 et:3 rc:1002958 ro:25 xx:1] processed=[17847.1MB ~414450H/s] api=[918req/s 17607.9MB] runtime=36.4min
  • 918x requests per second to api.pwnedpasswords.com
  • Log sources are shorthand:
    • lc: 13616 from local-cache (lc) - request-responses handled locally without hitting the network.
    • et: 3 etag-matched (et) - request-responses that confirmed our local data was up-to-date and did not require a new download.
    • rc: 1002958 from remote-cache (rc) - request-responses that were downloaded to local, but came from the remote-server cache.
    • ro: 25 from remote-origin (ro) - request-responses that were downloaded to local, and the download needed to be fetched from remote origin source.
    • xx: 1 failed responses - request-responses that failed (and successfully retried).
  • ~17GB downloaded in ~36 minutes (full dataset)
  • Approx ~414k hash values received per second
  • Processing in this example appears to be CPU bound, measured traffic around ~160 Mbit/s.

Usage (query)

screenshot-help.png

Project

Copyright

All rights reserved.

License

  • BSD-3-Clause - see LICENSE file for details.

hibp-downloader's People

Contributors

ndejong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

sebras rillke took

hibp-downloader's Issues

Timeouts (short dns timeout value?)

I'm repeatedly getting the following errors running the downloader:

Process Process-11:
Traceback (most recent call last):
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/anyio/_core/_sockets.py", line 190, in connect_tcp
    addr_obj = ip_address(remote_host)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: 'api.pwnedpasswords.com' does not appear to be an IPv4 or IPv6 address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/anyio/_core/_tasks.py", line 115, in fail_after
    yield cancel_scope
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 114, in connect_tcp
    stream: anyio.abc.ByteStream = await anyio.connect_tcp(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/anyio/_core/_sockets.py", line 193, in connect_tcp
    gai_res = await getaddrinfo(
              ^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fbb37227890

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
    yield
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 113, in connect_tcp
    with anyio.fail_after(timeout):
  File "/usr/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/anyio/_core/_tasks.py", line 118, in fail_after
    raise TimeoutError
TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpx/_transports/default.py", line 66, in map_httpcore_exceptions
    yield
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpx/_transports/default.py", line 366, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 268, in handle_async_request
    raise exc
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 251, in handle_async_request
    response = await connection.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_async/connection.py", line 99, in handle_async_request
    raise exc
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_async/connection.py", line 76, in handle_async_request
    stream = await self._connect(request)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_async/connection.py", line 124, in _connect
    stream = await self._network_backend.connect_tcp(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_backends/auto.py", line 30, in connect_tcp
    return await self._backend.connect_tcp(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 112, in connect_tcp
    with map_exceptions(exc_map):
  File "/usr/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ConnectTimeout

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/hibp_downloader/commands/hibp_download.py", line 178, in queue_worker_process
    asyncio.run(pwnedpasswords_get_store_gather(result_queue, hash_prefixes, worker_args))
  File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/hibp_downloader/commands/hibp_download.py", line 182, in pwnedpasswords_get_store_gather
    metadata_results = await asyncio.gather(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/hibp_downloader/commands/hibp_download.py", line 228, in pwnedpasswords_get_and_store_async
    binary, metadata_latest = await pwnedpasswords_get(prefix, hash_type=hash_type, etag=etag, encoding=encoding_type)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/hibp_downloader/commands/hibp_download.py", line 269, in pwnedpasswords_get
    response = await httpx_binary_response(url=url, etag=etag, encoding=encoding, debug=httpx_debug)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/hibp_downloader/lib/http.py", line 53, in httpx_binary_response
    response = await client.send(request=request, stream=True)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpx/_client.py", line 1617, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpx/_client.py", line 1645, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpx/_client.py", line 1682, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpx/_client.py", line 1719, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpx/_transports/default.py", line 365, in handle_async_request
    with map_httpcore_exceptions():
  File "/usr/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/dam/.virtualenvs/pwneddb/lib/python3.11/site-packages/httpx/_transports/default.py", line 83, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectTimeout

The thread managing also does not handle the threads crashing, making those threads stall. Eventually every thread crashes and the main process gets stuck waiting for the now crashed threads to close.

Option to choose compression ?

Hi,

I would like to run a local clone of HIBP Passwords API with data from hibp-downloader and I would like to have files with no compression to serve it directly. It is possible to have option to choose between gzip|br|none ?

Thx

incremental backup

Hi!
Can I download the updated data into separate files? For example, the original database contained 100k hashes, in a month I will want to update. Where will new hashes be saved?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.