Code Monkey home page Code Monkey logo

Comments (7)

rthiiyer82 avatar rthiiyer82 commented on July 2, 2024 1

Thank you so much for the details. This is good enough. Will get back to you soon.

from weaviate.

rthiiyer82 avatar rthiiyer82 commented on July 2, 2024

@darshilshahquantive Thank you for bringing this up this issue..

Is it possible you are able to share your script to reproduce the issue?

from weaviate.

darshilshahquantive avatar darshilshahquantive commented on July 2, 2024

i don't have any particular script as we have different applications using weaviate and not a script , but i can provide you information like how we are importing the data and how we are making search queries etc

If you can provide a list of question that you want to know about our architecture that will be much helpful
If that helps !

from weaviate.

rthiiyer82 avatar rthiiyer82 commented on July 2, 2024

Hey @darshilshahquantive if you could provide information on the class schemas, how you are importing objects and query used for testing then that would be helpful!

from weaviate.

darshilshahquantive avatar darshilshahquantive commented on July 2, 2024

we have three products and i am providing functions used for each of the product for defining schema , importing object , retriving object

PRODUCT 1
Class Schemas

The schema for a document class in Weaviate is defined as follows:

json

{
  "class": "Document",
  "properties": [
    { "name": "title", "dataType": ["string"] },
    { "name": "content", "dataType": ["text"] },
    { "name": "account_id", "dataType": ["string"] }
  ],
  "multiTenancyConfig": { "enabled": true },
  "vectorizer": "text2vec-openai",
  "moduleConfig": {
    "text2vec-openai": {
      "resourceName": "openai-resource",
      "deploymentId": "text-embedding-ada-002"
    }
  }
}

Importing Objects

Objects are imported using the insert_chunked_data_into_vector_store function:

python

def insert_chunked_data_into_vector_store(chunks, uploaded_file_name, account_id, document_id):
    vector_store = get_raw_vector_store()
    WeaviateUtility.add_documents(
        vector_store,
        chunks,
        StrategyConfig.Weaviate.VS_MULTI_TENANCY_CLASS_NAME,
        account_id,
        uploaded_file_name,
        document_id,
    )

Query Used for Testing

A function for fetching relevant documents based on text:

python

def get_relevant_documents_for_text(account_id, text, document_ids):
    vs = get_raw_vector_store()
    query = (
        vs.query.get(StrategyConfig.Weaviate.VS_MULTI_TENANCY_CLASS_NAME, ["content", "source_file", "document_id"])
        .with_tenant(account_id)
        .with_hybrid(query=text, properties=["content"])
        .with_additional('rerank(property: "content" query: "' + text + '") { score }')
        .with_limit(5)
        .with_where({"path": ["document_id"], "operator": "ContainsAny", "valueTextArray": document_ids})
    )
    return query.do()

Explanation

Class Schemas: Defines the structure and properties of the data stored in Weaviate, including multi-tenancy and vectorizer configurations.
Importing Objects: Uses the Weaviate client to insert chunked data into the vector store.
Query: Fetches documents relevant to a given text, using a hybrid query with additional re-ranking based on content.

PRODUCT 2
Class Schemas

Example schema for class:

json

{
  "class": "TrackBC",
  "multiTenancyConfig": {"enabled": True},
  "vectorizer": "none",
  "properties": [
    {"name": "work_item_id", "dataType": ["text"]},
    {"name": "dataset_ids", "dataType": ["text[]"]},
    {"name": "external_id", "dataType": ["text"]},
    {"name": "external_key", "dataType": ["text"]},
    {"name": "external_system_url", "dataType": ["text"]},
    {"name": "name", "dataType": ["text"]},
    {"name": "description", "dataType": ["text"]},
    {"name": "suggested_description", "dataType": ["text"]},
    {"name": "status", "dataType": ["text"]},
    {"name": "type", "dataType": ["text"]},
    {"name": "assignee_emails", "dataType": ["text[]"]},
    {
      "name": "effort",
      "dataType": ["object"],
      "nestedProperties": [
        {"dataType": ["number"], "name": "amount"},
        {"dataType": ["text"], "name": "unit"},
        {"dataType": ["text"], "name": "description"}
      ]
    },
    {"name": "hash", "dataType": ["text"]}
  ]
}

Importing Objects

Objects are imported using the add_single_object function:

python

def add_single_object(client: weaviate.Client, class_name: str, tenant: str, uuid: str, vector: list, object_data: Dict[str, Any]) -> None:
    tenants = get_existing_tenants(client, class_name)
    if tenant in tenants:
        valid = client.data_object.validate(class_name=class_name, data_object=object_data, uuid=uuid, vector=vector)
        if valid["valid"]:
            try:
                client.data_object.create(class_name=class_name, data_object=object_data, tenant=tenant, uuid=uuid, vector=vector)
            except weaviate.ObjectAlreadyExistsException:
                logger.info(f"Object {uuid} already exists in weaviate")
                raise
            except Exception as e:
                logger.error(f"Error while adding weaviate object: {e}")
                raise
        else:
            logger.error(f"Weaviate Object {object_data} is not valid")
            raise Exception(f"Weaviate Object {object_data} is not valid")
    else:
        logger.error(f"Tenant {tenant} does not exist for class {class_name}")
        raise Exception(f"Tenant {tenant} does not exist for class {class_name}")

Query Used for Testing

Retrieving data in batches:

python

def get_data_in_batch(client: weaviate.Client, class_name: str, tenant: str, where_filter: Dict, properties: List = [], batch_size: int = 200) -> List[Dict[str, Any]]:
    count = get_count_of_objects(client, class_name, tenant, where_filter)
    weaviate_data = []
    for i in range(0, count, batch_size):
        try:
            response = (
                client.query.get(class_name, properties)
                .with_tenant(tenant)
                .with_where(where_filter)
                .with_additional(["vector id"])
                .with_limit(batch_size)
                .with_offset(i)
                .do()
            )

            if response.get("errors"):
                logger.error(f"Error while getting count of objects: {response['errors']}")
                raise Exception(f"Error while getting count of objects: {response['errors']}")
            else:
                objects_list = response["data"]["Get"][class_name]
        except Exception as e:
            logger.error(f"Error while getting weaviate data in batch: {e}")
            raise

        weaviate_data.extend(objects_list)
    return weaviate_data

Explanation

Class Schemas: Defines the structure and properties for data stored in Weaviate, including multi-tenancy and nested properties.
Importing Objects: Uses the Weaviate client to validate and insert data objects into the vector store, ensuring that the correct tenant exists.
Query: Retrieves data in batches from Weaviate based on a filter, useful for testing and verifying the number of objects.

PRODUCT 3
Class Schemas

Class schemas are created dynamically based on the properties provided when creating the collection:

python

def _use_collection(self, collection_name: str, properties: List[str], non_vectorized_properties: List[str] = []) -> None:
    existing_collections = self.client.schema.get()["classes"]
    collection_names = [collection["class"] for collection in existing_collections]
    capitalized_collection_name = collection_name.capitalize()
    
    if capitalized_collection_name not in collection_names:
        class_obj = {
            "class": capitalized_collection_name,
            "description": "A collection of documents",
            "multiTenancyConfig": {"enabled": True},
            "vectorizer": "text2vec-openai",
            "properties": [
                {
                    "name": p.lower(),
                    "description": "string",
                    "dataType": ["text"],
                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": p in non_vectorized_properties
                        }
                    },
                } for p in properties
            ],
            "moduleConfig": {
                "text2vec-openai": {
                    "model": "ada",
                    "modelVersion": "002",
                    "type": "text",
                }
            }
        }
        self.client.schema.create_class(class_obj)

Importing Objects

Checking document existence:

python

def _document_exist(self, collection_name: str, doc_id: str, namespace: str = None) -> bool:
    try:
        doc = (
            self.client.data_object.get_by_id(
                doc_id, class_name=collection_name.capitalize(), tenant=namespace
            )
            is not None
        )
        return doc
    except:
        return False

Adding a single object:

python

def _add_single_object_to_collection(self, document: Dict, collection_name: str, doc_id: str, overwrite: bool, namespace: str) -> bool:
    capitalized_collection_name = collection_name.capitalize()
    doc_exist = self._document_exist(capitalized_collection_name, doc_id, namespace)
    if not doc_exist or (doc_exist and overwrite):
        properties = {k.lower(): v for k, v in document.items()}
        vec = None
        self.client.data_object.create(
            data_object=properties,
            class_name=capitalized_collection_name,
            uuid=doc_id,
            tenant=namespace,
            vector=vec,
        )
        return True
    return False

Adding multiple documents:

python

def _add_documents(self, documents: List[Dict], collection_name: str, namespace: str, overwrite: bool = False) -> None:
    for doc in documents:
        doc_id = str(doc["doc_id"])
        self._add_single_object_to_collection(doc, collection_name, doc_id, overwrite, namespace)

Queries Used for Retrieval

Constructing and executing queries:

python

def _get_documents(self, collection_name: str, query: str, num_results: int, filters: Dict = None, query_projections: List[str] = ["*"], query_properties: List[str] = ["*"], is_hybrid: bool = False, alpha: float = 0.5, namespace: str = None) -> List[Dict]:
    capitalized_collection_name = collection_name.capitalize()
    search = self.client.query.get(capitalized_collection_name, query_projections)
    
    if filters is not None:
        search = search.with_where(filters)
    
    if is_hybrid:
        raw_response = (
            search.with_hybrid(query=query, alpha=alpha, properties=query_properties)
            .with_additional("score")
            .with_limit(num_results)
            .with_tenant(namespace)
            .do()
        )
    else:
        raw_response = (
            search.with_near_text({"concepts": [query]})
            .with_additional("score")
            .with_limit(num_results)
            .with_tenant(namespace)
            .do()
        )

    documents = [
        {"score": item["_additional"]["score"], **item} for item in raw_response["data"]["Get"][capitalized_collection_name]
    ]
    return documents

`We do use AND operators for filtering documents to retrive top most and to delete via filter

def _retrieve_top_distinct_values(
    self,
    query: str,
    database_name: str,
    schema_name: str,
    table_name: str,
    column_name: str,
    namespace: str,
    num_results=3,
) -> list[str]:
    """
    Retrieve the top distinct values for the given column and query.
    """
    filter = {
        "operator": "And",
        "operands": [
            {
                "path": ["database_name"],
                "operator": "Equal",
                "valueText": database_name,
            },
            {
                "path": ["schema_name"],
                "operator": "Equal",
                "valueText": schema_name,
            },
            {
                "path": ["table_name"],
                "operator": "Equal",
                "valueText": table_name,
            },
            {
                "path": ["column_name"],
                "operator": "Equal",
                "valueText": column_name,
            },
        ],
    }

    distinct_documents = self._get_documents(
        query=query,
        collection_name="distinct_column_values",
        num_results=num_results,
        namespace=namespace,
        filters=filter,
    )

    distinct_values = [doc["value"] for doc in distinct_documents]
    return distinct_values

`

Summary

Class Schemas: Defined dynamically when creating collections.
Importing Objects: Functions check for document existence and then add documents to the collection.
Retrieval Queries: Constructed to perform hybrid or near-text searches, with optional filters and projections.

I hope this helps , please feel free to ask if you dont get anything

from weaviate.

darshilshahquantive avatar darshilshahquantive commented on July 2, 2024

Also @rthiiyer82 , i have the pprof profile logs also if that can help in any way:

File: weaviate
Type: inuse_space
Time: Jun 10, 2024 at 7:31am (UTC)
Showing nodes accounting for 16638.89MB, 96.41% of 17258.55MB total
Dropped 406 nodes (cum <= 86.29MB)
flat flat% sum% cum cum%
13244.84MB 76.74% 76.74% 13244.84MB 76.74% github.com/weaviate/weaviate/adapters/repos/db/vector/hnsw/distancer.Normalize (inline)
1466.63MB 8.50% 85.24% 1466.63MB 8.50% github.com/weaviate/weaviate/adapters/repos/db/vector/hnsw.(*Deserializer).ReadLink
418.43MB 2.42% 87.67% 418.43MB 2.42% github.com/weaviate/weaviate/adapters/repos/db/vector/cache.(*shardedLockCache[go.shape.float32]).Grow

while the actual heap usage is 19.5 gb approx as per monitoring metrics, in profiler i am getting this insight of 17.3 gb approx

from weaviate.

darshilshahquantive avatar darshilshahquantive commented on July 2, 2024

@rthiiyer82 any updates on this ?

by the way i am also seeing continuous logs like this when the memory spike increases and as soon as weaviate gets a restart this logs seems gone (enabled the GCTRACE logs)
by any chance this can affect memoery spikes:

{"action":"lsm_memtable_flush","class":"Account_accountsg_categories","error":"switch active memtable: init commit logger: open /var/lib/weaviate/accountaccountsg_categories//lsm/objects/segment-1718480924736645345.wal: no such file or directory","index":"accountaccountsg_categories","level":"error","msg":"flush and switch failed","path":"/var/lib/weaviate/accountaccount_sg_categories//lsm/objects","shard":"","time":"2024-06-15T19:48:44Z"}

from weaviate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.