haidra-org / horde-worker-regen Goto Github PK

View Code? Open in Web Editor NEW

75.0 2.0 32.0 638 KB

The default client software to create images for the AI-Horde

Home Page: https://aihorde.net/

License: GNU Affero General Public License v3.0

Python 98.08% Batchfile 1.09% Shell 0.84%

ai ai-horde distributed-computing generative-ai volunteer-computing

horde-worker-regen's People

Contributors

Stargazers

Watchers

horde-worker-regen's Issues

`ALL MODELS` and other legacy meta load commands should be supported

Exception during start_inference_process causes hung worker

When an exception occurs in horde_worker_regen.process_management.worker_entry_points:start_inference_process the worker ends up in a stuck state where it's endlessly rotating models between its queues.

Also see:

Worker doesn't warn obviously enough when an update is required

I could potentially warn every status update message.

Low system RAM environments fail with SDXL (or other high RAM footprint models)

Of note, I have observed that SDXL seems to struggle with 13gb RAM/16 GB setups despite this seeming to work locally when I have had less (9gb system ram free) resources.

I was able to observe a single SDXL job finish, submit, but on the next job (also SDXL) the inference would hang on the sampler step in comfyui.

I suspect this may be tied to the hardcoded attempts to keep 9gb system memory free.

Add alchemy features/forms

In addition to previous forms, deepdanbooru interrogation is now supported via horde_safety.
There is some interest also in returning the image "features" as extracted by a clip model. This is what happens already as part of interrogation in horde_safety and is "free" in this respect.

Allow setting config file location by env var

Report forced defaults (image couldn't be parsed for img2img, Lora not applied, etc)

If a feature was requested but failed at the worker, it could be reported back to the API as to what happened.

Threading appears to cause job_id mixup; causes wrong result to be associated with a job_id

Option to use the disk lock (limit concurrent disk reads/writes) for slow disks

download models doesn't re-download new versions of models

If a new version of the model was added to the reference, the download script doesn't compare sha256 to redownload it.

Make sure package is installable and make available via pypi

Terminal UI

The previous worker had a well received curses UI. A similar sort of UI is certainly possible and would ideally show all the information as before (see link) and:

The state of each process
an option to see each individual process log file

Some of the previously show information, such as the Worker Total and Entire Horde could/should be moved to a separate stats screen to accommodate the per process details.

Implement single-process LoRa/TI preloading/management

See Haidra-Org/hordelib#128

Make the queued megapixelsteps (and associated variables) configurable somehow

Is it possible to make the worker more efficient (in terms of total throughput) by allowing more queued megapixelsteps, at the cost of average time-to-return for jobs going up (time from pop to time to submit, measured by performance on the API /workers/ endpoint).

If so, the megapixelsteps behavior should be adjusted in the bridge data config some how.

This may be a not-planned depending on the potential (or lack of) performance benefits.

Model load meta instruction to load by baseline

Download LCM LoRas by default

See Haidra-Org/AI-Horde#349 for the parent issue.

Incorporate LCM download to script/default download logic
Change worker version/specific feature flag.

Implement dynamic models or similar

When `max_threads` > 1, allow models to load on more than one process

If there are two or more jobs in the queue for the same model, and max_threads more than 1, the model should be loaded on a number of processes equal to the number of jobs for that model.

Putting the files into a path that contains spaces crashes on startup

I cloned the project into "/mnt/Margarine/File Storage/Apps/horde-worker-reGen/", except that when launching horde_bridge.sh, I get a crash with this log:

Using jemalloc from /usr/lib/x86_64-linux-gnu
/tmp/mambafFGSqIcPNmU: line 2: /mnt/Margarine/File: No such file or directory
/tmp/mambafFGSqIcPNmU: line 3: micromamba: command not found
/tmp/mambafFGSqIcPNmU: line 5: exec: python: not found
download_models.py exited with error code. Aborting

I suspect there is a missing pair of quotes in a script somewhere that makes the path get cut and thus not found. Would be great to get it fixed!

Why does IPV6 cause huggingface.co (or pytorch?) to fail SSL/TLS handshakes?

As reported and confirmed by 1 user in discord, certain downloads could not resolved/initiated through IPV6, leading the worker to fail in an obscure way.

Why would IPV6 specifically cause this?
Can we disable IPV6 in the short term?
- Are there any downsides to keeping IPV4-only in place?

`horde-bridge` script proceeds to start worker even if downloads fail

While SD models are not a huge problem, control nets and upscalers being missing would cause runtime problems.

Support all samplers that comfyui does natively

Ideally this would somehow reach into comfyui so we can dynamically support any sampler it does.

Support downloading models in process rather than requiring download_models.py to be run

Safety on gpu should default (or at least warn) with SDXL + 8gb VRAM

Automatic release action (with automatic `bridge_agent` update)

Releases should require little or no manual intervention to update version numbers and generate github releases.

Add a way to automatically detect the optimal settings for the current machine

add a "ALL XL" model selection to bridgeData

would choose all models tagged as XL.

Thank you!

LoRa+TI downloads could be instigated ahead of inference

Currently, if a LoRa or TI is not on disk, the download is not started until the inference message is received and comfyui is entered. This could be potentially done earlier, and potentially could be potentially part of the preload response by an inference process.

Implement by-style model meta load commands

Support runtime config file selection

Please implement support to direct the worker to a file other than bridgeData.yaml as the configuration file. This allows multiple worker instances to be run from the same directory with different configurations. This is useful for multi-GPU cases to avoid needing to duplicate the directory (and models) or work around a different way.

ERROR

critical libmamba Cannot activate, prefix does not exist at: 'C:\Users\yuuwa\OneDrive\Plocha\horde-worker-reGen-main\conda\envs\windows'

ERROR: Cannot install -r requirements.txt (line 5) because these package versions have conflicting dependencies.

The conflict is caused by:
hordelib 2.6.5 depends on mediapipe>=0.9.1.0
hordelib 2.6.4 depends on mediapipe>=0.9.1.0

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

i try lot of stuff but non of them help me

Single model workers configs should mean a less aggressive memory cleanup scheme

The primary intent behind leaving a certain amount of free system ram is to allow a cushion for potentially very large other models to load (such as SDXL models). However, in the situation where the worker is configured only to run a single model, the memory conditions become much more predictable and will fail anyway if an OOM occurs.

If the worker has one model only
- If the model has only a single model file
  - Keep the model entirely on VRAM 100% of the time
- If the model has multiple models (as is the case with Stable Cascade)
  - Avoid offloading to disk if possible, swapping the models only between RAM and VRAM.

If failures are met in this situation, its likely the model overhead would only be encouraging the worker to run in very poor memory conditions (as they would constantly be loading off disk for little to no reason).

`n_iter` support

The reGen worker was written with this option in mind, and it presently (should) always assume that more than one image result is possible, but API and hordelib machinery may need to be added or adjusted to fully confirm this works as intended.

Allow workers to configure a 'pinned' model that will be preferred

I can see the merit in having a model such as SDXL configured to pop on its own queue, or at least pop preferentially (by omitting the other models set to load while it doesn't have a job) so workers can offer the SD1.5 models while prioritizing SDXL (or whatever model they choose).

asyncio timeout on submit can put the worker into maintenance

Not quite sure what caused this stack trace, but it's also not on the trace logs

2024-01-28 14:38:45.455 | ERROR    | asyncio.events:_run:80 - An error has been caught in function '_run', process 'MainProcess' (4053147), thread 'MainThread' (140531284598656):
Traceback (most recent call last):

  File "/home/db0/projects/horde-worker-reGen/run_worker.py", line 110, in <module>
    main(multiprocessing.get_context("spawn"))
    │    │               └ <bound method DefaultContext.get_context of <multiprocessing.context.DefaultContext object at 0x7fcffb50b910>>
    │    └ <module 'multiprocessing' from '/usr/lib/python3.10/multiprocessing/__init__.py'>
    └ <function main at 0x7fcffb678f70>

  File "/home/db0/projects/horde-worker-reGen/run_worker.py", line 71, in main
    start_working(
    └ <function start_working at 0x7fcff8f5d480>

  File "/home/db0/projects/horde-worker-reGen/horde_worker_regen/process_management/main_entry_point.py", line 22, in start_working
    process_manager.start()
    │               └ <function HordeWorkerProcessManager.start at 0x7fcff81b8430>
    └ <horde_worker_regen.process_management.process_manager.HordeWorkerProcessManager object at 0x7fcff7a822c0>

  File "/home/db0/projects/horde-worker-reGen/horde_worker_regen/process_management/process_manager.py", line 2468, in start
    asyncio.run(self._main_loop())
    │       │   │    └ <function HordeWorkerProcessManager._main_loop at 0x7fcff81b83a0>
    │       │   └ <horde_worker_regen.process_management.process_manager.HordeWorkerProcessManager object at 0x7fcff7a822c0>
    │       └ <function run at 0x7fcffb572950>
    └ <module 'asyncio' from '/usr/lib/python3.10/asyncio/__init__.py'>

  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
           │    │                  └ <coroutine object HordeWorkerProcessManager._main_loop at 0x7fcf0f1685f0>
           │    └ <function BaseEventLoop.run_until_complete at 0x7fcffb0ec3a0>
           └ <_UnixSelectorEventLoop running=True closed=False debug=False>
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
    │    └ <function BaseEventLoop.run_forever at 0x7fcffb0ec310>
    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
    │    └ <function BaseEventLoop._run_once at 0x7fcffb0ede10>
    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
    │      └ <function Handle._run at 0x7fcffb0957e0>
    └ <Handle Task.task_wakeup(<Future cancelled>)>
> File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
    │    │            │    │           │    └ <member '_args' of 'Handle' objects>
    │    │            │    │           └ <Handle Task.task_wakeup(<Future cancelled>)>
    │    │            │    └ <member '_callback' of 'Handle' objects>
    │    │            └ <Handle Task.task_wakeup(<Future cancelled>)>
    │    └ <member '_context' of 'Handle' objects>
    └ <Handle Task.task_wakeup(<Future cancelled>)>

  File "/home/db0/projects/horde-worker-reGen/horde_worker_regen/process_management/process_manager.py", line 1676, in submit_single_generation
    async with self._aiohttp_session.put(
               │    │                └ <function ClientSession.put at 0x7fcff8c904c0>
               │    └ <aiohttp.client.ClientSession object at 0x7fcf0f1d14e0>
               └ <horde_worker_regen.process_management.process_manager.HordeWorkerProcessManager object at 0x7fcff7a822c0>

  File "/home/db0/projects/horde-worker-reGen/venv/lib/python3.10/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
    │    │             │    └ <member '_coro' of '_BaseRequestContextManager' objects>
    │    │             └ <aiohttp.client._RequestContextManager object at 0x7fcf00904250>
    │    └ <member '_resp' of '_BaseRequestContextManager' objects>
    └ <aiohttp.client._RequestContextManager object at 0x7fcf00904250>
  File "/home/db0/projects/horde-worker-reGen/venv/lib/python3.10/site-packages/aiohttp/client.py", line 467, in _request
    with timer:
         └ <aiohttp.helpers.TimerContext object at 0x7fcf00906530>
  File "/home/db0/projects/horde-worker-reGen/venv/lib/python3.10/site-packages/aiohttp/helpers.py", line 721, in __exit__
    raise asyncio.TimeoutError from None
          │       └ <class 'asyncio.exceptions.TimeoutError'>
          └ <module 'asyncio' from '/usr/lib/python3.10/asyncio/__init__.py'>

asyncio.exceptions.TimeoutError

Allow adding a civitAI token in bridgeData to send to hordelib to download models

see: Haidra-Org/hordelib#155

AMD GPU issues

When trying to use this with an AMD GPU it doesn't work, as it says that it is trying to search for an NVIDIA GPU

`load_env_vars` should be dynamic

The configuration file model (reGenBridgeData) could have metadata associated with fields, or perhaps a field itself, which describes which variables are environment variables that need to be loaded and the target environment variable name.

See https://github.com/Haidra-Org/horde-worker-reGen/blob/main/load_env_vars.py.

Pop an extra job if less than a threshold of megapixel steps is in the queue

For example, if queue is set to 1, but if the job queued is only 2 megapixelsteps, its probably safe to pop another job (for most reasonably modern cards)

Post-processor crashing makes the worker stall

From the logs of cozmyc, I noticed some weird OOM errors about Post-processors even though there should be more than enough VRAM. I don't quite understand why since I thought the CPU uses the RAM.

(Cozmyc has a very old CPU, so this is probably relevant)

2024-01-16 09:43:59.153 | ERROR    | hordelib.comfy_horde:send_sync:666 - execution_error, {'prompt_id': 'b0749e50-0fc3-423e-b157-d72a8511b395', 'node_id': 'face_restore_with_model', 'node_type': 'FaceRestoreWithModel', 'executed': ['model_loader', 'image_loader'], 'exception_message': 'Unable to allocate 384. MiB for an array with shape (4096, 4096, 3) and data type float64', 'exception_type': 'numpy.core._exceptions._ArrayMemoryError', 'traceback': ['  File "C:\\Users\\santiago\\AppData\\Local\\Programs\\Python\\Python310\\Lib\\site-packages\\hordelib\\_comfyui\\execution.py", line 154, in recursive_execute\n    output_data, output_ui = get_output_data(obj, input_data_all)\n', '  File "C:\\Users\\santiago\\AppData\\Local\\Programs\\Python\\Python310\\Lib\\site-packages\\hordelib\\_comfyui\\execution.py", line 84, in get_output_data\n    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)\n', '  File "C:\\Users\\santiago\\AppData\\Local\\Programs\\Python\\Python310\\Lib\\site-packages\\hordelib\\_comfyui\\execution.py", line 77, in map_node_over_list\n    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))\n', '  File "C:\\Users\\santiago\\AppData\\Local\\Programs\\Python\\Python310\\Lib\\site-packages\\hordelib\\nodes\\facerestore\\__init__.py", line 180, in restore_face\n    restored_img = face_helper.paste_faces_to_input_image()\n', '  File "C:\\Users\\santiago\\AppData\\Local\\Programs\\Python\\Python310\\Lib\\site-packages\\hordelib\\nodes\\facerestore\\facelib\\utils\\face_restoration_helper.py", line 527, in paste_faces_to_input_image\n    inv_soft_mask * pasted_face + (1 - inv_soft_mask) * upsample_img\n'],

We should look at our error crashing in the PP process to make it fail more gracefully and inform the user.

Use pyinstaller or similar instead of umamba

Use of datetime is problematic in timezones with DST

The timing logic relying on datetime should instead use time.time() to avoid headaches with time adjustments which at present cause the worker to die.

Jobs seem to be being dropped in cluster

See the grafana stats... jobs seem to get dropped in cluster every 4-5 hours, some times more often.

Add tox+pytest testing for business logic

`TOP N` models in excess num. models in reference leads to historical (and unavailable) models being used

Add a set of pilot jobs which confirm the work is possible with current settings

Give ETA + jobids pending when shutting down due to KeyboardInterrupt

Images sent without R2 should be compressed significantly

It was brought to my attention that currently we do not appear to be compressing the images a lot when R2 is disabled. This introduces significant bandwidth through b64 transfers and also takes a lot of DB space. When R2 is disabled, the compression of the image should be close to 50%

Verbosity count isn't being handled correctly

The default param for argparse when using count seems to use default as the minimum?

Add support for clarity-upscaler

This upscaler is a reverse-engineered version of Magnific, and does an amazing job of finding the right balance when adding detail to upscaling images. https://github.com/philz1337x/clarity-upscaler

Need to figure out how to get this working in Comfy-UI unless someone else does first, then integrate it into the worker.

haidra-org / horde-worker-regen Goto Github PK

horde-worker-regen's People

Contributors

Stargazers

Watchers

Forkers

horde-worker-regen's Issues

Recommend Projects

Recommend Topics

Recommend Org