nomic-ai / nomic Goto Github PK

View Code? Open in Web Editor NEW

1.0K 24.0 137.0 24.52 MB

Interact, analyze and structure massive text, image, embedding, audio and video datasets

Home Page: https://atlas.nomic.ai

Makefile 0.69% Python 99.31%

python clustering duplicate-detection embeddings text topic-modeling unstructured-data

nomic's People

Stargazers

Watchers

Forkers

techthiyanes griseljimenez apjanco davidnmora yesthing vcip2015 dorucioclea guoqiangjia phi-line lsmz23 nkaligin nurb432 octaviusgm popandepo hightylergreen tylerai yuanzhanghu wiwomu m4cs jastworld bsef19m520 hendrikmax cibuildorg parweztahseen gurprreetsingh eltociear iamdarklion techventurebuilder ashok969 tanzeelrana hhy5277 parisneo wlinds tvytlx haipayaxi puppergump kjbkjb0712 newmedia2 guhjy aryanutkarsh ljmat i-z-z-y abaso007 komolapok lambertcsy marcostolosa tsciubba markac007 srikalyan muhamadgomaa john-codes mikeamcbrien jfoggo comsa33 nragan34 llaith-ai 0xo0o0 4z0t karthikashwath tahabinhuraib ginko-ai athletic-geek hero echallenge selad kukambas akashmavle5 caiodavic tynorth hobbesfnm mknw new4u macguyversmusic goodwill-ken gmh5225 eqeiland nbdy henryhesz chugarah ayuryshev jarekmor carlosd1111 erickwill xang1234 gobbletown ethicalsecurity-agency bhctest123 taskers1 martyyz2112 brainhub24 mz0in sorokinvld clarajzt hartswf0 mike100101100011 ya117 apollohuang1 ntrystn automationkit myownipgit

nomic's Issues

Deleted projects seem to hold their namespace

If I create a project, delete it using project.delete(), and then immediately create a new project with the same name, I get a retrieval error--not sure if this is caching on the web API or what.

topics / labels

I could not find in the Documentation, how labels / topics are created. Automatically? Based on the indexed_field? Can you pls point me to the right source? Or should I create a separate field, labeling each 'text'? Thanks for your great product !

List projects under my account

Hey, is there a method to list all projects under my Nomic account?
Thanks!

Add OpenAI embedding integration

This should be structured similarly to the existing Cohere integration.
https://beta.openai.com/docs/guides/embeddings

Retrieving project with different options should check that contraints are actually true upstream

If you call get_project with a bunch of options, it will retrieve the exists project if it exists. BUT it should also check that the parameters you've passed are valid. Otherwise you can do something like

get_project('foo', unique_id = 'gluck')

and then in a later session

get_project('foo', unique_id = 'bar')

and the code incorrectly asserts an id field the project will not use.

Project bar is ugly in jupyter notebook

Rather than update the pbar in place, it draws a stairway leading down the hell of a bad UX.

Feature request: stdout callback

Hello; I'm currently writing a web UI for the ChatGPT4All. I've faced an issue where the only way for me to get output is to consume it at end. I'd like to request a change which will allow to pass a stdout callback function - which might be done without breaking backwards compatibility.

It could be done similar to this: https://github.com/Venthe/chatgpt4all-webui/blob/main/server/prompt_parser.py

I've also added ID there but it is of co consequence, as it can be a part of the callback. Same thing with type, as I've ended up doing it indempotently.

Proposal for the callback signature:
'''python
def callback(responseCharacter)
'''

Where the default could be as it is now, sys.stdout.print

Your authorization token is no longer valid. Run `nomic login` to obtain a new one.

I keep getting this error message when I try to run the basic example in atlas document in jupyter notebook.
I run the following code cell by cell:
!nomic login
!nomic login [token i got]
(no error here)

import numpy as np

num_embeddings = 10000
embeddings = np.random.rand(num_embeddings, 256)

project = atlas.map_embeddings(embeddings=embeddings)

then I get this error message:

ValueError Traceback (most recent call last)
/tmp/ipykernel_1642635/3339716498.py in
6 embeddings = np.random.rand(num_embeddings, 256)
7
----> 8 project = atlas.map_embeddings(embeddings=embeddings)

~/anaconda3/envs/lyx/lib/python3.8/site-packages/nomic/atlas.py in map_embeddings(embeddings, data, id_field, name, description, is_public, colorable_fields, build_topic_model, topic_label_field, num_workers, organization_name, reset_project_if_exists, add_datums_if_exists, shard_size, projection_n_neighbors, projection_epochs, projection_spread)
82 } for _ in range(len(embeddings))]
83
---> 84 project = AtlasProject(
85 name=project_name,
86 description=description,

~/anaconda3/envs/lyx/lib/python3.8/site-packages/nomic/project.py in init(self, name, description, unique_id_field, modality, organization_name, is_public, project_id, reset_project_if_exists, add_datums_if_exists)
861
862 if organization_name is None:
--> 863 organization_name = self._get_current_users_main_organization()['nickname']
864
865 results = self._get_existing_project_by_name(project_name=name, organization_name=organization_name)

~/anaconda3/envs/lyx/lib/python3.8/site-packages/nomic/project.py in _get_current_users_main_organization(self)
122 '''
123
--> 124 user = self._get_current_user()
...
---> 99 raise ValueError("Your authorization token is no longer valid. Run nomic login to obtain a new one.")
100
101 return response.json()

ValueError: Your authorization token is no longer valid. Run nomic login to obtain a new one.

I will appreciate if you can provide some insight or solution to this problem!

Visualize non-Euclidean embeddings

I'm trying to visualize 128 dimensional embeddings in hyperbolic space using Atlas. However, I noticed that Atlas's create_index function includes the line

'nearest_neighbor_index_hyperparameters': json.dumps({'space': 'l2', 'ef_construction': 100, 'M': 16})

in the build_template here. This line uses the l2 distance while I have to use the hyperbolic distance function as my embeddings are in hyperbolic space. I know your dimensionality reduction algorithm is closed source, but I was wondering

Is there a way to specify a custom distance function when creating embeddings? This could be a helpful feature for users looking for more customization.
Is the l2 space used when performing dimensionality reduction or is it perhaps only used for nearest neighbor search?

Any answers to this would be greatly appreciated. Thanks!

Remove GPT4All from repository safely.

We have over 100 dependencies. We need to do some form of safe migration away.

I should be able to run `p.create_index` even if I have no data yet

User feedback over e-mail. Filing does not imply endorsement.

I should be able to run p.create_index even if I have no data yet: a user of the library can’t anticipate this and it might unexpectedly throw in prod.

Hardcoded lora path in GPT4AllGPU prevents loading of models

The Python class GPT4AllGPU relies on a hard-coded lora path:
self.lora_path = 'nomic-ai/vicuna-lora-multi-turn_epoch_2'

The referenced lora appears to be no longer available on Huggingface.

Please consider changing this hard-coded path to one that is user-defined and updating the documentation so that future users know what the GPU class expects.

Fetch from Arrow endpoints

Methods for fetching from arrow endpoints.

No windows binaries for GPT4All

Hi everyone,

I was trying to test the GPT4All on My PC, and was faced with the problem that it doesn't have a binary version of windows :

Your platform is not supported: Windows. Current binaries supported are x86 Linux and ARM Macs.

Is there a way, a windows version coming soon?
I think many people around the world use this platform so it would be helpful not to be forced to install WSL to do this.

Best regards

AtlasProject.total_datums doesn’t update when you run add_embeddings.

User feedback. Probably lacking a call to update state.

unclear error message

Noting an error here, will investigate later.

Getting a bad error on the main branch on a second upload. Rather than listing the conflicting ids, it is printing the string set().

  0%|          | 0/1 [00:00<?, ?it/s]2023-02-12 13:48:52.081 | ERROR    | nomic.project:add_text:1059 - Shard upload failed: {'detail': 'Insert failed due to ID conflict. Conflicting IDs: set()'}

Export atlas view as image

I have been playing around with Nomics API and I was wondering if it is possible to export a particular view (at least the general view) of an atlas as PNG or other image format.

project reports having no fields after a successful upload.

Expected behavior: it lists all the fields in the data.

Problems with browser client features

I'm trying to play with Atlas, and I'm having trouble with a lot of features (in the "bells and whistles" sense) on my dataset:

Filtering by "correct" property is strange.

I have a "correct" property that is a boolean. Since Atlas doesn't support booleans, I converted it to an integer. The behavior when filtering on this field is odd. If you move the upper range below 1, no points display, ever. If you move the lower range above 0, sometimes you get points, and sometimes the space is blank. There are points with correct=0 and correct=1 (at least, there should be -- there are before I uploaded them to Atlas).

Searching by "correct" property

No matter what I try to search for, nothing happens when I try to search. You can't tell from the video, but I'm hitting enter between all of these.

Only "Dirname" shows up as an option for coloring, but I added Filename and correct as colorable fields as well. Maybe correct can't be used because it's an integer? But Filename is a string.

Any help would be appreciated. This seems like a neat project.

Automatically infer colorable_fields on create_index

Asking people to pass colorable_fields is a little annoying; if not passed, we could infer colorability based on the following rules.

If it's a float or an int, include it.
If it's a date, include it.
If it's a string, check the cardinality; include as a likely categorical if it has lots of repeats. (note--this is the tricky one, because the cardinality of previously uploaded data isn't necessarily handy when a user creates an index.)

pip install nomic on Windows: error in nomic setup command: 'extras_require' must be a dictionary

See the issue posted in GPT4All repo

Client side data validation for colorable fields

Ran into cryptic error message (Exception: [{'loc': ['body', 'colorable_fields', 0], 'msg': 'str type expected', 'type': 'type_error.str'}]) while uploading my colorable fields like below. Turns out the colorable_fields variable was a tuple on my end, and this was accepted by the function and then gave back the error message above.

project = AtlasProject(name=name, unique_id_field="id",reset_project_if_exists=False,modality='text')
project.add_text(sample_df.iloc[:1000])
colorable_fields = ["timestamp","sentiment","score",'subreddit.name'],
indexedField = 'body'
project.create_index(name, 
                    indexed_field=indexedField, 
                    colorable_fields = colorable_fields,
                    multilingual=False, 
                    build_topic_model=True)

add tooltips on https://atlas.nomic.ai/map/*

so useful thanks

Raise more informative better errors, change some warnings to errors.

User feedback.

I feel you should raise an error instead of only logging something if I’ve run out of storage. In prod I need to abort if I’ve hit the limit or I might end up serving a db with only half the points in. This also happens if I try to upload values which have conflicting ids. At the very least don’t log ‘Upload succeeded’ when it didn't

504 error Unable to create index for AtlasProject

import random
from datasets import load_dataset

def stream_data(T=5000):
    safe = load_dataset('allenai/soda')['train']
    unsafe = iter(
        [e for e in load_dataset('allenai/prosocial-dialog')['train']
         if e['safety_label'] == '__needs_intervention__']
    )
    
    for t in range(T):
        p_unsafe = t/(3*T) #slowly increase toxic content to 33% chance
        is_unsafe = random.uniform(0, 1) <= p_unsafe
    
        if is_unsafe:
            x = next(unsafe)
            yield 'unsafe', t, x['context'] + ' ' + x['response']
        else:
            x = safe[t]
            yield 'safe', t, x['dialogue'][0]

data = [e for e in stream_data(6000)]
train_data = data[:5000]
test_data = data[5000:]

import time
import numpy as np
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from nomic import atlas, AtlasProject

import torch
from torch import device
# Create a SentenceTransformer object to vectorize our text
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device('cuda'))

batch_size = 1000
batched_texts = []
batched_metadatas = []

# Create a project instance with the 'embedding' modality
project = AtlasProject(name='Chat analysis', unique_id_field='id', modality='embedding')

# Inserting data into atlas
for i, row in enumerate(tqdm(train_data)):
    label, timestamp, text = row[0], row[1], row[2]

    #Batch data for faster adds. You can also add data one at a time if you like
    batched_texts.append(text)
    batched_metadatas.append({'id': i, 'label': label, 'timestamp': timestamp})

    if len(batched_texts) >= batch_size:
        # Generate embeddings
        embeddings = model.encode(batched_texts)

        # Upload embeddings and metadata to Nomic
        project.add_embeddings(
            embeddings=np.array(embeddings),
            data=batched_metadatas,
        )

        # clean up batch
        batched_texts = []
        batched_metadatas = []

        #Wait for the index to build
        time.sleep(1)

%%time
# Create an index and build a topic model
project.create_index(name=project.name, build_topic_model=True, topic_label_field='text')

>>>
Create project failed with code: 504
2023-06-15 15:45:20.109 | INFO     | nomic.project:create_index:1401 - Additional info: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>504 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
CloudFront attempted to establish a connection with the origin, but either the attempt failed or the origin closed the connection.

JSONDecodeError                           Traceback (most recent call last)
File <timed eval>:2

File ~\anaconda3\envs\speech2text\lib\site-packages\nomic\project.py:1402, in AtlasProject.create_index(self, name, indexed_field, colorable_fields, multilingual, build_topic_model, projection_n_neighbors, projection_epochs, projection_spread, topic_label_field, reuse_embeddings_from_index, duplicate_detection, duplicate_threshold)
   1400     logger.info('Create project failed with code: {}'.format(response.status_code))
   1401     logger.info('Additional info: {}'.format(response.text))
-> 1402     raise Exception(response.json()['detail'])
   1404 job_id = response.json()['job_id']
   1406 job = requests.get(
   1407     self.atlas_api_path + f"/v1/project/index/job/{job_id}",
   1408     headers=self.header,
   1409 ).json()

File ~\anaconda3\envs\speech2text\lib\site-packages\requests\models.py:975, in Response.json(self, **kwargs)
    971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

When I check back my dashboard, I repeatedly get logged out.

Rerunning the same lines previously within my Jupyter Notebook also returns me a different error message afterward, which I presume to be a credential issue related.

project.create_index(name=project.name, build_topic_model=True, topic_label_field='text')

>>>
2023-06-15 15:55:08.356 | INFO     | nomic.project:create_index:1400 - Create project failed with code: 400
2023-06-15 15:55:08.357 | INFO     | nomic.project:create_index:1401 - Additional info: {"detail":"Topic model hyperparameters are invalid"}
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
File <timed eval>:2

File ~\anaconda3\envs\speech2text\lib\site-packages\nomic\project.py:1402, in AtlasProject.create_index(self, name, indexed_field, colorable_fields, multilingual, build_topic_model, projection_n_neighbors, projection_epochs, projection_spread, topic_label_field, reuse_embeddings_from_index, duplicate_detection, duplicate_threshold)
   1400     logger.info('Create project failed with code: {}'.format(response.status_code))
   1401     logger.info('Additional info: {}'.format(response.text))
-> 1402     raise Exception(response.json()['detail'])
   1404 job_id = response.json()['job_id']
   1406 job = requests.get(
   1407     self.atlas_api_path + f"/v1/project/index/job/{job_id}",
   1408     headers=self.header,
   1409 ).json()

Exception: Topic model hyperparameters are invalid

do you plan to provide qdrant vector database sample

Could you provide the Qdrant sample?

Error occurs when downloading tiles.

Use this from the site tb = project.indices[0].projections[0].web_tile_data(overwrite=True) but got this error
AttributeError: 'AtlasProjection' object has no attribute 'web_tile_data'. Did you mean: '_tile_data'?

`df` property of AtlasMapTopics does not include topic ids, only topic labels

    @property
    def df(self) -> pandas.DataFrame:
        """
        A pandas dataframe associating each datapoint on your map to their topics as each topic depth.
        """
        return self.tb.to_pandas()

    @property
    def tb(self) -> pa.Table:
        """
        Pyarrow table associating each datapoint on the map to their Atlas assigned topics.
        This table is memmapped from the underlying files and is the most efficient way to
        access topic information.
        """
        return self._tb

print(topic_data.df[0:1])

      id topic_depth_1       topic_depth_2  topic_depth_3
0  18963  Music videos  Youtube, bitchute,  youtube video

UnicodeEncodeError

Trying to construct a text map using atlas, it can work when I sample a small number of data. But when I use all the data, it got errors:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[8], line 17
     13 print(data[0])
     15 max_documents = 500000
---> 17 project = atlas.map_text(data=data,
     18                           indexed_field='dialogue',
     19                           name='UltraChat',
     20                           id_field='id',
     21                           description='Large-scale, high-quality, and diverse muli-round dialogue data.',
     22                           )

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/nomic/atlas.py:226, in map_text(data, indexed_field, id_field, name, description, build_topic_model, multilingual, is_public, colorable_fields, num_workers, organization_name, reset_project_if_exists, add_datums_if_exists, shard_size, projection_n_neighbors, projection_epochs, projection_spread)
    224         logger.info(f"{project.name}: Deleting project due to failure in initial upload.")
    225         project.delete()
--> 226     raise e
    228 logger.info("Text upload succeeded.")
    230 # make a new index if there were no datums in the project before

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/nomic/atlas.py:218, in map_text(data, indexed_field, id_field, name, description, build_topic_model, multilingual, is_public, colorable_fields, num_workers, organization_name, reset_project_if_exists, add_datums_if_exists, shard_size, projection_n_neighbors, projection_epochs, projection_spread)
    216     logger.warning("Passing 'num_workers' is deprecated and will be removed in a future release.")
    217 try:
--> 218     project.add_text(
    219         data,
    220         shard_size=None,
    221     )
    222 except BaseException as e:
    223     if number_of_datums_before_upload == 0:

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/nomic/project.py:1387, in AtlasProject.add_text(self, data, pbar, shard_size, num_workers)
   1385     data = pa.Table.from_pandas(data)
   1386 elif isinstance(data, list):
-> 1387     data = pa.Table.from_pylist(data)
   1388 elif not isinstance(data, pa.Table):
   1389     raise ValueError("Data must be a pandas DataFrame, list of dictionaries, or a pyarrow Table.")

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:3705, in pyarrow.lib.Table.from_pylist()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:5226, in pyarrow.lib._from_pylist()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:3580, in pyarrow.lib.Table.from_arrays()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:1391, in pyarrow.lib._sanitize_arrays()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:1372, in pyarrow.lib._schema_from_arrays()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2350-2351: surrogates not allowed

I have removed all the non-ASCII data but still got the error, any ideas?

Docs layout is broken on safari

User feedback over e-mail

Failure when two columns have same name but in upper- and lower-case

If I upload a table that has columns called ID and id, everything appears to work fine but then the topic builds break.

nomic should raise an error if a user tried to upload columns with different names.

Improve state handling on deletion

if I run

project.delete()

wait a minute, and then run

project.total_datums

I get the number of datums in the project before it was deleted. That's not right. Upon deletion, projects should ideally transition to some state where they raise errors on most methods.

AttributeError: 'AtlasMapEmbeddings' object has no attribute 'atlas_api_path'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[17], line 1
----> 1 for e in map.embeddings.get_embedding_iterator():
      2     print(e)
      3     break

File ~/Minds/projects/topical/nomic/lib/python3.10/site-packages/nomic/data_operations.py:509, in AtlasMapEmbeddings.get_embedding_iterator(self)
    506 limit = EMBEDDING_PAGINATION_LIMIT
    507 while True:
    508     response = requests.get(
--> 509         self.atlas_api_path
    510         + f"/v1/project/data/get/embedding/{self.project.id}/{self.projection.atlas_index_id}/{offset}/{limit}",
    511         headers=self.header,
    512     )
    513     if response.status_code != 200:
    514         raise Exception(response.text)

AttributeError: 'AtlasMapEmbeddings' object has no attribute 'atlas_api_path'

ImportError: cannot import name 'GPT4AllGPU' from 'nomic.gpt4all'

There is no reference for the class GPT4ALLGPU on the file nomic/gpt4all/init.py

After adding the class, the problem went away.

For anyone with this problem, just make sure you init file looks like this:

from nomic.gpt4all import GPT4All, GPT4AllGPU, prompt
Edit: Typo. Thanks for bringing that to my attention @mir-ashiq !

Demo colab error: AttributeError: 'AtlasProjection' object has no attribute 'vector_search'

Trying to re-run the colab demo, and running the following snippet

# Now perform similarity search over the map!
map = project.maps[0]
with project.wait_for_project_lock():
  neighbors, _ = map.vector_search(ids=[0], k=5)

#print the 5 most similar datapoints to the data point with id = 0 (including the point with id=0)
similar_datapoints = project.get_data(ids=neighbors[0])
for point in similar_datapoints:
  print(point)

yields the following error:

AttributeError: 'AtlasProjection' object has no attribute 'vector_search'

I haven't edited the notebook, so not sure why the demo isn't working.

Not raising value error if id field is too long.

I'm getting 'pyarrow.lib.ChunkedArray' object has no attribute 'utf8_length' at this line instead of it raising the error on the next line when I have an id field that is too long.

nomic/nomic/project.py

Line 257 in 4af847f

    
           first_match = data.filter(data[project.id_field].utf8_length() > 36).to_pylist()[0]

mwe

# nomic 1.1.6
# python 3.10.10
# pyarrow 11.0.0
from nomic import atlas
import numpy as np

project = atlas.AtlasProject(
    name="asdf",
    modality="embedding",
    unique_id_field="id",
    reset_project_if_exists=True,
)

embeddings = np.random.rand(273, 384)
data = [{"id": "hello {i}" * 128} for i in range(273)]

project.add_embeddings(embeddings=embeddings, data=data)

Error uploading text

I was able to upload data to my project yesterday morning but in the afternoon I start getting this error.

File "C:\Python310\lib\site-packages\nomic\atlas.py", line 237, in map_text
project.add_text(
File "C:\Python310\lib\site-packages\nomic\project.py", line 1184, in add_text
data = pa.Table.from_pylist(data)
File "pyarrow\table.pxi", line 3906, in pyarrow.lib.Table.from_pylist
File "pyarrow\table.pxi", line 5453, in pyarrow.lib._from_pylist
File "pyarrow\table.pxi", line 3781, in pyarrow.lib.Table.from_arrays
File "pyarrow\table.pxi", line 1434, in pyarrow.lib._sanitize_arrays
File "pyarrow\table.pxi", line 1415, in pyarrow.lib._schema_from_arrays
File "pyarrow\array.pxi", line 327, in pyarrow.lib.array
File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object

I thought maybe a release or something but it's still happening today.

What changed?

`update_maps` shard_size is deprecated

update_maps currently passes a shard_size parameter. Within the method, update_maps creates a progress bar based on shard_size and assumes it is a numeric value. However, when this value then gets passed to add_embeddings or add_text, shard_size is treated as deprecated and will raise a warning.

Hence, due to this conflict, the method does not work. The task to fix this is to remove shard_size option from update_maps.

Why does project.create_index return an AtlasProjection and not an AtlasIndex?

Now able to find `detect_duplicate` in the documentation or as code

I am trying to dedup my dataset. The Atlas Duplicate Clustering section in the documentation has a line - "Make sure to enable duplicate clustering by setting detect_duplicate = True when building a map". I am not able to find this argument in the Atlas API references or in this GitHub repo.

The bottom line question is - can I dedup my dataset using Atlas?

Link OpenAI and Langchain docs from main docs.

So people can find all the relevant information.

AttributeError: 'AtlasProjection' object has no attribute 'is_locked'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[29], line 5
      3 data_points = {}
      4 project.wait_for_project_lock()
----> 5 for id, embed in map.get_embedding_iterator():
      6     data_points[id] = {'embedding': embed}
      8 topic_data = map.get_topic_data()

File ~/Minds/projects/topical/nomic/lib/python3.10/site-packages/nomic/project.py:614, in AtlasProjection.get_embedding_iterator(self)
    605 def get_embedding_iterator(self) -> Iterable[Tuple[str, str]]:
    606     '''
    607     Iterate through embeddings of your datums.
    608 
   (...)
    611 
    612     '''
--> 614     if self.is_locked:
    615         raise Exception('Project is locked! Please wait until the project is unlocked to download embeddings')
    617     offset = 0

AttributeError: 'AtlasProjection' object has no attribute 'is_locked'

KeyError in AtlasProjection.group_by_topic(3)

Source:

project = atlas.AtlasProject(project_id=PROJECT)
map = project.get_map()
topic_groups = map.group_by_topic(3)

Error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[24], line 3
      1 project = atlas.AtlasProject(project_id=PROJECT)
      2 map = project.get_map()
----> 3 topic_groups = map.group_by_topic(3)

File ~/Minds/projects/topical/nomic/lib/python3.10/site-packages/nomic/project.py:745, in AtlasProjection.group_by_topic(self, topic_depth)
    742 result_dict = {}
    743 topic_metadata = topic_df[topic_df["topic_short_description"] == topic]
--> 745 subtopics = hierarchy[topic]
    746 result_dict["subtopics"] = subtopics
    747 result_dict["subtopic_ids"] = topic_df[topic_df["topic_short_description"].isin(subtopics)]["topic_id"].tolist()

KeyError: 'Cloud and Server Hosting'

Allow Adding Arrow Tables/Batches Directly

In #214 we have to call to_pylist() on the batches, it probably would be faster to upload the batches directly

GPT4All platform support

We currently have a really stupid way of wrapping this into python that puppeteering binaries hosted on S3 through stdout. I am not proud of this. The next step is to get some real C++ object wrappers based on [https://github.com/ggerganov/llama.cpp] controlling the repo and some platform-specific build wheels.

Current status:

[x] Linux x86
[x] Mac M1/M2
[ ] Mac x86
[ ] Windows x86

Show nearest points in the original space

Tensorflow has a cool embedding visualizer here and I think some of their features are really cool and should be added to atlas.

For example, when you click on some datapoint X, Tensorflow's visualizer shows you which points are closest to datapoint X in the original space. This helps you to understand how the dimensionality reduction technique preserves and distorts distances.

I think this would be a great feature to add to atlas. Cheers!

ValueError: Underscore fields are reserved for Atlas internal use: __index_level_0__

but if we just run reset_index() before uploading it'll be fine.

nomic-ai / nomic Goto Github PK

nomic's People

Stargazers

Watchers

Forkers

nomic's Issues

mwe

Recommend Projects

Recommend Topics

Recommend Org