marqo-ai / marqo Goto Github PK

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

License: Apache License 2.0

Python 99.72% Dockerfile 0.02% Shell 0.23% Lua 0.03%

deep-learning information-retrieval machinelearning vector-search tensor-search clip multi-modal search-engine transformers vision-language

marqo's Issues

[BUG] 'unavailable_shards_exception'

Describe the bug
A clear and concise description of what the bug is.

When I was trying to index data, it was taking a long time (minutes) for only a couple of documents. Previously it had taken seconds. When a response was returned it had the following errors:

{'errors': True,
 'items': [{'_id': '0734ac47-5337-434f-8a20-e0e157e78e2b',
   'status': 503,
   'error': {'type': 'unavailable_shards_exception',
    'reason': '[my-multimodal-index][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[my-multimodal-index][0]] containing [3] requests]'}},

To Reproduce
Steps to reproduce the behavior:

Follow the instructions for running the simple http server to allow reading images (https://marqo.pages.dev/advanced_usage/)
Index some data with an image, e.g.

{'jpg_http': 'http://host.docker.internal:8222/dataset/iconic-images-and-descriptions/Fruit/Apple/Golden-Delicious/Golden-Delicious_Iconic.jpg',
  'txt_all': 'Golden Delicious has a white juicy pulp and a greenish yellow '
             'shell. The taste is mellow and sweet, making Golden Delicious '
             'suitable for desserts.\n'},

Expected behavior
The image would be indexed in < 1 second without an error

Desktop (please complete the following information):

OS: Ubuntu 20.04

Additional context

[ENHANCEMENT] Add support for audio cross-modal search

Is your feature request related to a problem? Please describe.
Currently only text and images are supported. Adding support for cross-modal search with audio.

Describe the solution you'd like
Add in support for cross-modal search using audio

Additional context
https://github.com/AndreyGuzhov/AudioCLIP
https://github.com/descriptinc/lyrebird-wav2clip
https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html#

[BUG] not receiving the desired results

Describe the bug
i took the code from readme and i change the first dict inside the list

    {
        "Title": "Indian culture",
        "Description": '''What, one wonders, is the lowest common denominator of Indian culture today? The attractive Hema Malini? The songs of Vinidh Barati? The attractive Hema Malini? The sons of Vinidh Barati?
 Or the mouth-watering Masala Dosa? Delectable as these may be, each yield pride of place to that false (?) symbol of a new era-the synthetic fibre. In less than twenty years the nylon sari and the terylene shirt have swept the countryside, penetrated to the farthest corners of the land and persuaded every common man, woman and child that the key to success in the present-day world lie in artificial fibers: glass nylon, crepe nylon, tery mixes, polyesters and what have you. More than the bicycles, the wristwatch or the transistor radio, synthetic clothes have come to represent the first step away form the village square. The village lass treasures the flashy nylon sari in her trousseau most delay; the village youth gets a great kick out of his cheap terrycot shirt and trousers, the nearest he can approximate to the expensive synthetic sported by his wealthy citybred contemporaries. And the Neo-rich craze for ‘phoren’ is nowhere more apparent than in the price that people will pay for smuggled, stolen, begged borrowed second hand or thrown away synthetics. Alas, even the uniformity of nylon. 

'''
    }

and the question i ask is to :- q="The latest symbol of modernity for the rural people is?"

now when i redo the question that is now i have the same questions as in the readme doc and run using python command but i still get the output of previous question.

the expected output is the same in readme!

the system is mac m1 and i am running with the help of docker.

[ENHANCEMENT] read pdfs, txt, csv files from pointers

Is your feature request related to a problem? Please describe.
Yes - it could be good to support reading csv, txt, pdf files from pointers (not scanned though). It would read the text directly (no ocr).

Describe the solution you'd like
Have a reader in the same way we do for images. So a pointer a file means it can be read.

Describe alternatives you've considered
Alternatives are that the user does this processing before Marqo. This will always be an option but for less complex use cases it would be very convenient.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] Setup Marqo with Docker Compose

Describe the bug
When testing a basic example with Marqo on a project set up with Docker Compose, the requests module fails with the following error:
requests.exceptions.InvalidSchema: No connection adapters were found for '"http://admin:admin@opensearch-node1:9200"/marqo-simplewiki-demo-all'

To Reproduce

Use the following Docker Compose file to set up the infra.

version: "3.7"
services:

  # Official Marqo Build
  marqo-rt:
    image: marqoai/marqo:0.0.4
    container_name: marqo-rt
    environment:
      - OPENSEARCH_URL="http://opensearch-node1:9200"
    ports:
      - 8882:8882
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - opensearch-net
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  opensearch-node1:
    image: opensearchproject/opensearch:2.3.0
    container_name: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
      - "DISABLE_INSTALL_DEMO_CONFIG=true" # disables execution of install_demo_configuration.sh bundled with security plugin, which installs demo certificates and security configurations to OpenSearch
      - "DISABLE_SECURITY_PLUGIN=true" # disables security plugin entirely in OpenSearch by setting plugins.security.disabled: true in opensearch.yml
      - "discovery.type=single-node" # disables bootstrap checks that are enabled when network.host is set to a non-loopback address
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data
    ports:
      - 9200:9200
      - 9600:9600 # required for Performance Analyzer
    networks:
      - opensearch-net

  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.3.0
    container_name: opensearch-dashboards
    ports:
      - 5601:5601
    expose:
      - "5601"
    environment:
      - 'OPENSEARCH_HOSTS=["http://opensearch-node1:9200"]'
      - "DISABLE_SECURITY_DASHBOARDS_PLUGIN=true" # disables security dashboards plugin in OpenSearch Dashboards
    networks:
      - opensearch-net

volumes:
  opensearch-data1:

networks:
  opensearch-net:

Try to run the simplewiki example against that instance.

Expected behavior
Example should work as advertised.

Desktop (please complete the following information):

OS: Ubuntu 22.04 with NVidia GeForce 2070 Super (properly recognized by nvidia-smi)

Additional context
I'm aware the OpenSearch setup is different (I'm using the default from their website) but it's not significantly different from the one provided by Marqo. The failure I see is within the Marqo container, before any API call makes its way to opensearch.

[ENHANCEMENT] Search Operation Should Return Multiple Highlights.

Is your feature request related to a problem? Please describe.
Currently, a search operation only returns one highlight for each indexed document.

Describe the solution you'd like
Get the option to specify the number of highlights to be returned for each indexed document.

Describe alternatives you've considered
None

Additional context
I am creating a podcast-demo-code, wherein I index two documents, and each document has the name of the podcast, a short description, and the full transcript.
So whenever I perform a search operation, it just returns one highlight over the whole transcript, I think it will be good if there is an option to return multiple highlights.

PUT Documents - delete fields [ENHANCEMENT]

Is your feature request related to a problem? Please describe.
There is no way to delete a field in an existing doc using the PUT /documents call.

Describe the solution you'd like
A way to delete a field in an existing doc using the PUT /documents endpoint.

Describe alternatives you've considered
The best way is to use the POST /documents endpoint. But this can be expensive.

[ENHANCEMENT] support timm models for images

Is your feature request related to a problem? Please describe.
At the moment, only CLIP models are supported. These are good models and work across language and text. However, there are lots of other models in timm that are SOTA in classification and can still provide good embedding. They also span a large range of sizes and architectures so offer good accuracy/latency trade-offs.

Describe the solution you'd like
A new class of timm models can be specified for the "model" field at index creation time.

Describe alternatives you've considered
None

Additional context
https://github.com/rwightman/pytorch-image-models

[ENHANCEMENT] Add support for videos

Is your feature request related to a problem? Please describe.
Currently only text and images are supported, videos are not yet natively (they can if input frame-by-frame).

Describe the solution you'd like
Video files can be processed directly without needing to split into frames.

Additional context
https://github.com/facebookresearch/fairseq/tree/main/examples/MMPT
https://github.com/microsoft/VideoX/tree/master/X-CLIP

[ENHANCEMENT] Run Marqo on Google colab

Is your feature request related to a problem? Please describe.
A demonstration of Marqo running on google colab

Describe the solution you'd like
Marqo running in colab

Additional context
A good start might be to reproduce the readme examples in google colab

[ENHANCEMENT] - Get Indices

Is your feature request related to a problem? Please describe.
I have no way of querying which indices have been loaded to OpenSearch to perform operations on them.

Describe the solution you'd like
A function on the Client object to return a list of indices.

[ENHANCEMENT] Guide for running Marqo on Azure

Is your feature request related to a problem? Please describe.
We should have a guide for users to run Marqo on Azure, similar to the AWS guide that we are currently adding.

Describe the solution you'd like
Any user should be able to read the guide and follow best practices to set up Marqo on azure

Describe alternatives you've considered
none

Additional context
none

[ENHANCEMENT] Guide to running Marqo on SageMaker notebooks instances/within a notebook

Is your feature request related to a problem? Please describe.
A guide to running Marqo on SageMaker notebook instances and from within a notebook

Describe the solution you'd like
A quick tutorial

[ENHANCEMENT] read .svg files

Is your feature request related to a problem? Please describe.
Yes - only images (png, bmp, jpg etc) that are natively supported by PIL can be read.

Describe the solution you'd like
svg files can be read directly for indexing or searching

[ENHANCEMENT] Allow returning tensors results based on average rather than maximum

Is your feature request related to a problem? Please describe.
Allow users to use average rather than maximum when searching vectors

Describe the solution you'd like
When a user creates an index, they specify whether they would like it structured as an average or as a maximum (or both). If the user chooses average, we compute all the vectors, average them and store it.

Docker not installing

I am having trouble installing Docker on my machine, it tells me that I need windows 10 pro or later.

Incorrect default argument in "search" function

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:
call the search method and don't specify the "search_method"

Expected behavior
A clear and concise description of what you expected to happen.
Should have displayed the search results

Screenshots
If applicable, add screenshots to help explain your problem.
marqo.errors.MarqoWebError: MarqoWebError: MarqoWebError Error message: {"detail":[{"loc":["body","searchMethod"],"msg":"NEURAL is not a valid SearchMethod","type":"value_error"}]} status_code: 422, type: unhandled_error_type, code: unhandled_error, link:

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]
MacOS Montery v12.5

Additional context
Add any other context about the problem here.

[BUG] Request Entity Too Large

Describe the bug
I am using the image-to-image search mode. During the "add_documents" process, marqo throws out an error message which says "HTTPError: 413 Client Error: Request Entity Too Large for url: http://localhost:8882/indexes/duplicate-detection-index/documents?refresh=true&device=cpu". I use 10,000 images to build the index. Is there a limit on the number of images? Or it's something else caused the problem? What can I do to fix this issue? Thanks.

Expected behavior
Use the provided 10,000 images to build the index, and find similar images for the query images.

Screenshots

Desktop (please complete the following information):

OS: [e.g. Linux]
Browser [e.g. chrome]
Version [e.g. 22]

[BUG] tox failing to find PYTHONPATH

Describe the bug
Trying to run tox from marqo results in a key error, cannot find PYTHONPATH

py38 create: /home/jesse/code/s2search/marqo/.tox/py38
___________________________________________________________________________ summary ____________________________________________________________________________
  py38: commands succeeded
  congratulations :)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tox/config/__init__.py", line 354, in get
    return self.resolved[name]
KeyError: 'PYTHONPATH'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/tox", line 11, in <module>
    load_entry_point('tox==3.13.2', 'console_scripts', 'tox')()
  File "/usr/lib/python3/dist-packages/tox/session/__init__.py", line 44, in cmdline
    main(args)
  File "/usr/lib/python3/dist-packages/tox/session/__init__.py", line 68, in main
    exit_code = session.runcommand()
  File "/usr/lib/python3/dist-packages/tox/session/__init__.py", line 192, in runcommand
    return self.subcommand_test()
  File "/usr/lib/python3/dist-packages/tox/session/__init__.py", line 220, in subcommand_test
    run_sequential(self.config, self.venv_dict)
  File "/usr/lib/python3/dist-packages/tox/session/commands/run/sequential.py", line 9, in run_sequential
    if venv.setupenv():
  File "/usr/lib/python3/dist-packages/tox/venv.py", line 594, in setupenv
    status = self.update(action=action)
  File "/usr/lib/python3/dist-packages/tox/venv.py", line 252, in update
    self.hook.tox_testenv_create(action=action, venv=self)
  File "/usr/lib/python3/dist-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/lib/python3/dist-packages/pluggy/manager.py", line 92, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/lib/python3/dist-packages/pluggy/manager.py", line 83, in <lambda>
    self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
  File "/usr/lib/python3/dist-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/usr/lib/python3/dist-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/lib/python3/dist-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/lib/python3/dist-packages/tox/venv.py", line 682, in tox_testenv_create
    venv._pcall(
  File "/usr/lib/python3/dist-packages/tox/venv.py", line 553, in _pcall
    env = self._get_os_environ(is_test_command=is_test_command)
  File "/usr/lib/python3/dist-packages/tox/venv.py", line 472, in _get_os_environ
    env.update(self.envconfig.setenv)
  File "/usr/lib/python3/dist-packages/tox/config/__init__.py", line 370, in __getitem__
    x = self.get(name, self._DUMMY)
  File "/usr/lib/python3/dist-packages/tox/config/__init__.py", line 364, in get
    self.resolved[name] = res = self.reader._replace(val)
  File "/usr/lib/python3/dist-packages/tox/config/__init__.py", line 1516, in _replace
    replaced = Replacer(self, crossonly=crossonly).do_replace(value)
  File "/usr/lib/python3/dist-packages/tox/config/__init__.py", line 1552, in do_replace
    expanded = substitute_once(value)
  File "/usr/lib/python3/dist-packages/tox/config/__init__.py", line 1550, in substitute_once
    return self.RE_ITEM_REF.sub(self._replace_match, x)
  File "/usr/lib/python3/dist-packages/tox/config/__init__.py", line 1597, in _replace_match
    return self._replace_substitution(match)
  File "/usr/lib/python3/dist-packages/tox/config/__init__.py", line 1632, in _replace_substitution
    val = self._substitute_from_other_section(sub_key)
  File "/usr/lib/python3/dist-packages/tox/config/__init__.py", line 1626, in _substitute_from_other_section
    raise tox.exception.ConfigError("substitution key {!r} not found".format(key))
tox.exception.ConfigError: ConfigError: substitution key '/' not found

To Reproduce

cd into marqo
run tox

Expected behavior
tox runs

Desktop (please complete the following information):

OS: Ubuntu 20.04

[ENHANCEMENT] better summary of devices and models running at Marqo startup

Is your feature request related to a problem? Please describe.
When marqo starts up, it checks the available devices and outputs a summary. It also loads some models and tests them on the devices. It is a bit hard to pass though from the logs if something was a successful. An improvement would be to have a better summary of the devices and the success of running models on these.

Describe the solution you'd like
On start up, the results from the devices and the models are displayed in a single table.

Describe alternatives you've considered
Nothing really

Additional context
Here is the place https://github.com/marqo-ai/marqo/blob/mainline/src/marqo/tensor_search/on_start_script.py#L88-L112

kalkulator.py

Pagination [ENHANCEMENT]

Is your feature request related to a problem? Please describe.
It's inconvenient to only have a single page of results. for search queries. For certain use cases, it would be great for end users to scroll through pages of results.

Describe the solution you'd like
Implement an offset parameter to search functions. Limit and offset parameters can then enable scrolling through result pages. offset and limit can be mapped to the backend opensearch pagination parameters from and size.

Describe alternatives you've considered
Returning a large number of documents. Then manually implementing a client-side scroll feature. This is a lot of work, and it means larger memory overhead for clients.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] UnicodeDecodeError while reading simplewiki.json

Describe the bug
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34260: character maps to
Python's json is unable to read simplewiki dataset.
Python Version: 3.9.13

Console Output

(venv) D:\Codes\marqo-wiki\src>python simple_wiki_demo.py
Traceback (most recent call last):
 File "D:\Codes\marqo-wiki\src\simple_wiki_demo.py", line 34, in 
  data = read_json(dataset_file)
 File "D:\Codes\marqo-wiki\src\simple_wiki_demo.py", line 15, in read_json
  data = json.load(f)
 File "C:\Users\anubh\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
  return loads(fp.read(),
 File "C:\Users\anubh\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
  return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34260: character maps to < undefined >

To Reproduce
Steps to reproduce the behavior:

Go to simplewiki example
Run simple_wiki_demo.py using python simple_wiki_demo.py
See error

Expected behavior
json.load is supposed to load the data without any errors.

Desktop (please complete the following information):

OS: Windows 10 Home

Additional context
Working fix: change encoding type while reading file in read_json function. Replace line 13 of this demo script --> with open(filename, 'r', encoding='utf-8') as f:

Tensor prefiltering not working for fields with spaces [BUG]

Describe the bug
Tensor prefiltering not working for fields with spaces. When a filter is applied to a field with a space, no documents are retrieved.

To Reproduce
Steps to reproduce the behavior:

Have Marqo running
Index some documents. One should have a field with a space in it:

curl -XPOST  'http://localhost:8882/indexes/my-irst-ix/documents?refresh=true&device=cpu' -H 'Content-type:application/json' -d '
[ 
    {
        "Title": "Honey is a delectable food stuff", 
        "Desc" : "some boring description",
        "_id": "honey_facts_119",
        "gapped field": "wololo"
    }, {
        "Title": "Space exploration",
        "Desc": "mooooon! Space!!!!",
        "_id": "moon_fact_145"
    }
]'

Doing a filtered lexical search works (document id: "honey_facts_119" is retrieved):

curl -XPOST  'http://localhost:8882/indexes/my-irst-ix/search?device=cpu' -H 'Content-type:application/json' -d '{
    "q": "what do bears eat?",
    "searchMethod": "LEXICAL",
    "filter": "gapped\\ field:wololo"
}'

But doing a filtered tensor search doesn't (document id: "honey_facts_119" isn't retrieved):

curl -XPOST  'http://localhost:8882/indexes/my-irst-ix/search?device=cpu' -H 'Content-type:application/json' -d '{
    "q": "what do bears eat?",
    "searchMethod": "TENSOR",
    "filter": "gapped\\ field:wololo"
}'

Expected behavior
When a filter is applied to a field with a space, the documents should be retrieved, in the same way it is for lexical search

Desktop (please complete the following information):

OS: Ubuntu amd64
Version 0.0.3

Additional context
Add any other context about the problem here.

[BUG] highlights return types are different for different search methods

Describe the bug
The results highlights field has a different return type for the different search methods. LEXICAL returns an empty list while TENSOR returns a dict.
accessed viaresults['hits'][0]['_highlights']
To Reproduce
Steps to reproduce the behavior:

Install marqo per readme

import marqo as mq
client = mq.Client()
client.index("my-first-index").add_documents([{'text':'something'}])
client.index("my-first-index").search('something') # _highlights type is a dict
client.index("my-first-index").search('something', search_method='LEXICAL') # _highlights type is a list

Expected behavior
The return types are the same.

Screenshots

Desktop (please complete the following information):

OS: [e.g. iOS] ubuntu 20.04

Run OpenSearch for Marqo on a port other than 9200

I tried to run docker run -p 9000:9000 -p 9600:9600 -e “discovery.type=single-node” opensearchproject/opensearch:2.1.0. Notice this has a different port binding to 9200:9200 which is found in the README.md.
I’m running Elasticsearch service in the background for an unrelated task. This is also running at port 9200 so I wanted to run OpenSearch for Marqo on a different port.

Slow Inference on arm64 machines

Describe the bug: A clear and concise description of what the bug is.
marqo is taking a lot of time to index and search

To Reproduce: Steps to reproduce the behavior:
On an arm64 machine, use docker to run marqo, and then try to index and search

Expected behavior: A clear and concise description of what you expected to happen.
Operations should be fast

Desktop (please complete the following information):
OS: macOS Monterey v12.5
Machine: M1 MacBookPro, 2020

[BUG] `PIL.UnidentifiedImageError: cannot identify image file` during `index.add_documents(...)`

Describe the bug
When adding image files from a URL I received the above error. To mitigate the issue which I believe could've occurred due to rate-limiting on my end, I pre-downloaded all images and served it similar to the apparel demo i.e. via python3 -m http.server 8222. This also still fails with the above error. As a workaround, I am manually inserting each document one at a time as shown below

for data_doc in tqdm(data):
    try:
        responses = mq.index(index_name).add_documents([data_doc], device=device)
        # print(f"<SUCCESS>\nAdded prompt:\n{data_doc['prompt']}\nURI: {data_doc['raw_discord_data_image_uri']}\n")
    except:
        print(f"<FAILURE>\nSkipping prompt:\n{data_doc['prompt']}\nURI: {data_doc['raw_discord_data_image_uri']}\n")

Doing it one at a time does not fail for the direct URL or the pre-downloaded one. I will be continuing my current flow by pre-downloading still. For testing the above is fine but for my real workload of 10M data points this is a blocker.

To Reproduce
Steps to reproduce the behavior:

marqo_settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        # "image_preprocessing": {
        #     "patch_method": "frcnn"
        # },
        "model":"ViT-B/16",
        "normalize_embeddings":True,
    },
}

Using settings where I include a model for dense retrieval and a URI which may look like this url: https://cdn.discordapp.com/attachments/1005627160410722305/1006718276879003738/rick_and_morty_as_the_thing_fused_with_lovecraft_high_details_intricate_details_renaissance_style_painting_by_vincent_di_fate_artgerm_julie_bel_-H_768_-n_9_-i_-S_687487568_ts-1660090738_idx-3.png
or this internal path: http://host.docker.internal:8222/./artifacts/sample_prompts:v1/sample_prompts/rick_and_morty_as_the_thing_fused_with_lovecraft_high_details_intricate_details_renaissance_style_painting_by_vincent_di_fate_artgerm_julie_bel_-H_768_-n_9_-i_-S_687487568_ts-1660090738_idx-3.png
I get the above behavior consistently.

This behavior comes from /app/src/marqo/s2_inference/clip_utils.py which errs at Image.open(requests.get(image, stream=True).raw)

Expected behavior
Instead of erring the whole insert, I would like an option to ignore errors and be told via logs, stdout, and a response that problematic data points were not inserted at the very least.

Desktop (please complete the following information):
Using Google's Vertex Workbench

OS: Debian 10
Environment: Python 3 configured for CUDA 11.0 and Intel MKL
Machine Type: n1-standard-4 (4 vCPUs, 15GB RAM)

[BUG] RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Describe the bug
Sometimes during indexing with a GPU the following error can arise

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

To Reproduce
Steps to reproduce the behavior:

client.index(index_name).add_documents(documents, batch_size=50, device='cuda', processes=4)

Expected behavior
It should index all the documents

Desktop (please complete the following information):

OS: ubuntu 20.04
RTX 3090

Tensor is not a valid Search Method

When using the example in the readme to help test I get an error stating that tensor is not a valid search method

To Reproduce

After setting up the docker requirements I created a file
Copied the data from the read me to use in the file
Run it in my terminal
See error in terminal

Expected behavior
I expected it to show a similar result like what was in

Screenshots

**Desktop **

OS: Windows 11
Chrome

[ENHANCEMENT] Remove the tokenisers parallelism

Is your feature request related to a problem? Please describe.
Batching with multiple processes for a huggingface based model causes the tokenizer to default to non-multi processing to avoid deadlocks. Due to the nature of the inference that is done, it would be ok to either turn off the parallelism in the tokenizer or switch to the python based one. There is no degradation in performance, just a constant warning message.

Describe the solution you'd like
Set an environment variable to turn off the parallelism in the hf tokenizer or default to the python based one.

Additional context
For the env var

TOKENIZERS_PARALLELISM=false

putting that somewhere in the startup script would probably work https://github.com/marqo-ai/marqo/blob/mainline/run_marqo.sh
alternatively the python based tokenizer could be called for all hf based models

[ENHANCEMENT] Guide for running Marqo on GCP

Is your feature request related to a problem? Please describe.
We should have a guide for users to run Marqo on GCP, similar to the AWS guide that we are currently adding.

Describe the solution you'd like
Any user should be able to read the guide and follow best practices to set up Marqo on GCP

Describe alternatives you've considered
none

Additional context
none

onnx requires cmake to be installed

I've tried to install and test marqo on local but it wasn't able to install the ossx package because it uses the cmake, like in that issue.

easily fixed by just running pip install cmake but might be a bit confusing.

should we include it in readme or try to add cmake as a required package?

[BUG]

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

[BUG] Can't use any tools with marqo

After installing marqo and marqo libraries, I can't run any code using marqo. Even simple codes from readme

Error:
ConnectionRefusedError: [Errno 61] Connection refused
HTTPConnection object at 0x7f7860bce220>: Failed to establish a new connection: [Errno 61] Connection refused
in send_request
raise BackendCommunicationError(str(err)

Desktop

OS (macOS 12.0.1)

Show number of chunks and document an index has in the stats endpoint[ENHANCEMENT]

Is your feature request related to a problem? Please describe.
I would like to know the sum of the chunks and documents found in an index.

Describe the solution you'd like
When calling the stats endpoint for an index, I'd like to be shown the sum of the chunks and documents found in that index.

Describe alternatives you've considered
There is no real way to do this currently.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] How to install to develop locally or access some of the sub-libraries

Describe the bug
Marqo is used through docker. This means for some development or testing it can take longer than desired.

Expected behavior
An easy way to develop locally to test the sub-libraries

Desktop (please complete the following information):

OS: [e.g. iOS] Ubuntu 20.04

Sentence-transformers requires rust compiler

When trying to install the marqo, sentence-transformers requires the rust compiler to be installed on the system.

[ENHANCEMENT] Patch items

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
We should be able to update a single field (for example some metadata) without having to reindex all the document data.

Describe the solution you'd like
A clear and concise description of what you want to happen.
We should have a patch operation for documents where if the user provides just some of the fields, those fields are added/updated as per the provided ID.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
We could just rely on updates, but for some use cases users need to be able to update like 10k embeddings and only need to change a metadata field, so doing all the transformations is a big waste of compute

Additional context
Add any other context or screenshots about the feature request here.

[ENHANCEMENT] End-to-end demos reproduced in Jupyter notebooks

Is your feature request related to a problem? Please describe.
The end-to-end examples are python scripts, it would be good to also have notebook versions. this would allow easier display of images in particular for the examples

Describe the solution you'd like
Jupyter notebook versions of the demos

Issue Installing marqo dependencies on m1 mac

i have been trying to install marqo dependencies on my machine but i keep getting error, the error says;

ERROR: Could not find a version that satisfies the requirement onnxruntime-gpu (from marqo-engine) (from versions: none)
ERROR: No matching distribution found for onnxruntime-gpu

i also attached a screenshot of error i am getting

remote cluster in marqo config

Due to osx security localhost doesn't work for me so Im using 127.0.0.1, but marqo considers it as a remote cluster

code from config.py

lowered_url = url.lower()
if "localhost" in lowered_url or "0.0.0.0" in lowered_url:
urllib3.disable_warnings()
self.cluster_is_remote = False

We should consider including 127.0.0.1

local_url_list = ["localhost", "0.0.0.0", "127.0.0.1"]
if [local_url for local_url in local_url_list if local_url in lowered_url]:

first search is always 5 seconds longer then following

Im not sure if it is fixable or not but processingTimeMs of the first search after initialisation of client is always 5.2+ seconds, while all the following no matter if I'm trying same index or any other are around 100ms

[BUG] "POST /indexes/marqo-simplewiki-demo-all HTTP/1.1" 500 Internal Server Error

Describe the bug
Server error 500 trying to run the SimpleWiki example. Tried with 0.0.5 and 0.0.3 as the demo shows.

gllermaly@ubuntu-s-2vcpu-4gb-amd-nyc1-01:~/marqo/SimpleWiki$ python3 simple_wiki_demo.py
loaded data with 188557 entries
Traceback (most recent call last):
File "/home/gllermaly/.local/lib/python3.10/site-packages/marqo/_httprequests.py", line 131, in __validate
request.raise_for_status()
File "/usr/lib/python3/dist-packages/requests/models.py", line 943, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:8882/indexes/marqo-simplewiki-demo-all

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/gllermaly/marqo/SimpleWiki/simple_wiki_demo.py", line 64, in
client.create_index(index_name, model='onnx/all_datasets_v4_MiniLM-L6')
File "/home/gllermaly/.local/lib/python3.10/site-packages/marqo/client.py", line 62, in create_index
return Index.create(
File "/home/gllermaly/.local/lib/python3.10/site-packages/marqo/index.py", line 78, in create
return req.post(f"indexes/{index_name}", body={
File "/home/gllermaly/.local/lib/python3.10/site-packages/marqo/_httprequests.py", line 99, in post
return self.send_request(requests.post, path, body, content_type)
File "/home/gllermaly/.local/lib/python3.10/site-packages/marqo/_httprequests.py", line 77, in send_request
return self.__validate(response)
File "/home/gllermaly/.local/lib/python3.10/site-packages/marqo/_httprequests.py", line 134, in __validate
convert_to_marqo_error_and_raise(response=request, err=err)
File "/home/gllermaly/.local/lib/python3.10/site-packages/marqo/_httprequests.py", line 148, in convert_to_marqo_error_and_raise
raise MarqoWebError(message=response_msg, code=code, error_type=error_type,
marqo.errors.MarqoWebError: MarqoWebError: MarqoWebError Error message: {'message': "HTTPSConnectionPool(host='localhost', port=9200): Max retries exceeded with url: /marqo-simplewiki-demo-all (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f3bcd18cc10>: Failed to establish a new connection: [Errno 111] Connection refused'))", 'code': 'backend_communication_error', 'type': 'internal', 'link': ''}
status_code: 500, type: internal, code: backend_communication_error, link:

To Reproduce
Steps to reproduce the behavior:

Install marqo
Run examples/SimpleWiki

Expected behavior
Demo should work

Desktop (please complete the following information):

OS: Ubuntu 22 server DO fresh droplet

Unsupported Docker Images for M1 (arm64)

I have an M1 MacBook Pro running macOS Monterey v12.5. Crashing behaviour is being caused when I try to install Marqo through the following Docker commands:

docker rm -f marqo
DOCKER_BUILDKIT=1 docker build . -t marqo_docker_0
docker run --name marqo --privileged -p 8000:8000 --add-host host.docker.internal:host-gateway marqo_docker_0

Error messages:

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested Starting supervisor starting dockerd command dockerd command complete Waiting for processes to be running Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Process dockerd is not running yet. Retrying in 1 seconds
failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to create NAT chain DOCKER: iptables failed: iptables -t nat -N DOCKER: iptables v1.8.4 (legacy): can't initialize iptables table 'nat': iptables who? (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded.

The first error message originates from the run_marqo.sh file, and the second one is a docker error message.

[BUG] Getting started curl documentation is outdated/wrong

Curl getting started says that we use /indices api, while package itself uses /indexes.
Also execution of the getting started commands return wrong response {"detail":"Not Found"}{"detail":"Not Found"}.
It's especially confusing when switching from package to curl and vice versa.

incorrect results when using image from local disk

I am currently creating a demo application for marqo showcasing the multi-modal index feature.

The dataset I used is a bunch of clothing apparel (shirts, shorts, shoes, hats, etc), which can be found here:
Clothing Dataset

I was able to load the images from local disk by using the docker command:
docker run --name marqo --mount type=bind,source=/user/someone/images/,target=/user/someone/images/ --privileged -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:0.0.1

where I replaced the source directory /user/someone/images/ to the directory where the images are located,
and the target directory /user/someone/images/ to the directory where I want to save uploaded files.

When searching images posted via web URL, marqo works as intended.
Image link used: https://d1mcl5z4l1p8tu.cloudfront.net/media/catalog/product/cache/73f803a782a839317b5e9918c11efa7e/c/o/corneliani85g571-0125050-007-4.jpg

However when uploading the same picture from local disk, it returns incorrect results:

Query used for local directory:
'C:\Users\Vitus\Documents\Work\marqo\demo\corneliani85g571-0125050-007-4.jpg\'
(am using Windows)

[ENHANCEMENT] Digital ocean one click deploy

Is your feature request related to a problem? Please describe.
We should have a digital ocean "one click deploy" for users to run Marqo on Digital ocean, which automatically sets up a GPU instance with adequate storage and memory, along with a guide which details to users the cost and tradeoffs of different options.

Describe the solution you'd like
Any user should be able to read the guide and easily set up Marqo on digital ocean using the "one click" setup

Describe alternatives you've considered
none

Additional context
none

[ENHANCEMENT] Add support for open-clip models

Is your feature request related to a problem? Please describe.
Only the official CLIP models are supported by default. Adding support for open-clip would extend the capabilities.

Describe the solution you'd like
Add support for open-clip models.

Additional context
https://github.com/mlfoundations/open_clip

[BUG]

docker rm -f marqo;docker run --name marqo -it --privileged -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:0.0.3

Not working. Error: unknown flag: --name

marqo-ai / marqo Goto Github PK

marqo's Issues

code from config.py

Recommend Projects

Recommend Topics

Recommend Org