Code Monkey home page Code Monkey logo

nomic's Introduction

Nomic Atlas Python Client

Explore, label, search and share massive datasets in your web browser.

This repository contains Python bindings for working with Nomic Atlas, the worldโ€™s most powerful unstructured data interaction platform. Atlas supports datasets from hundreds to tens of millions of points, and supports data modalities ranging from text to image to audio to video.

With Nomic Atlas, you can:

  • Generate, store and retrieve embeddings for your unstructured data.
  • Find insights in your unstructured data and embeddings all from your web browser.
  • Share and present your datasets and data findings to anyone.

Where to find us?

https://atlas.nomic.ai/

Atlas Map of Arxiv Data
Articles Submitted to Arxiv (10/12/2023 - 10/19/2023)
Atlas Map of TikTok Data
Historical TikTok Dataset (Indexed on Metadata Descriptions)

Table of Contents

Quick Resources

Try the ๐Ÿ““ Colab Demo to get started in Python

Read the ๐Ÿ“• Atlas Docs

Join our ๐Ÿ›– Discord to start chatting and get help

Example maps

๐Ÿ—บ๏ธ Map of Twitter (5.4 million tweets)

๐Ÿ—บ๏ธ Map of StableDiffusion Generations (6.4 million images)

๐Ÿ—บ๏ธ Map of NeurIPS Proceedings (16,623 abstracts)

Features

Here are just a few of the features which Atlas offers:

  • Organize your text, image, and embedding data
  • Create beautiful and shareable maps with or without coding knowledge
  • Have easy access to both high-level data structures and individual datapoints
  • Search millions of datapoints instantly
  • Cluster data into semantic topics
  • Tag and clean your dataset
  • Deduplicate text, images, video, audio

Nomic banner logo

Quickstart

Installation

  1. Install the Nomic library
pip install nomic
  1. Login or create your Nomic account:
nomic login
  1. Follow the instructions to obtain your access token.
nomic login [token]

Make your first map

from nomic import atlas
import numpy as np

# Randomly generate a set of 10,000 high-dimensional embeddings
num_embeddings = 10000
embeddings = np.random.rand(num_embeddings, 256)

# Create Atlas project
dataset = atlas.map_data(embeddings=embeddings)

print(dataset)

Atlas usage examples

Access your embeddings

Atlas stores, manages and generates embeddings for your unstructured data.

You can access Atlas latent embeddings (e.g. high dimensional) or the two-dimensional embeddings generated for web display.

# Access your Atlas map and download your embeddings
map = dataset.maps[0]

projected_embeddings = map.embeddings.projected
latent_embeddings = map.embeddings.latent
print(projected_embeddings)
# Response:
id 	x 	y
0 	9.815330 	-8.105308
1 	-8.725819 	5.980628
2 	13.199472 	-1.103389
... 	... 	... 	...
print(latent_embeddings)
# Response:
n x d numpy.ndarray where n = number of datapoints and d = number of latent dimensions

View your dataโ€™s topic model

Atlas automatically organizes your data into topics informed by the latent contents of your embeddings. Visually, these are represented by regions of homogenous color on an Atlas map.

You can access and operate on topics programmatically by using the topics attribute of an AtlasMap.

# Access your Atlas map
map = dataset.maps[0]

# Access a pandas DataFrame associating each datum on your map to their topics at each topic depth.
topic_df = map.topics.df

print(map.topics.df)
Response:

id topic_depth_1 topic_depth_2
0 Oil Prices mergers and acquisitions
1 Iraq War Trial of Thatcher
2 Oil Prices Economic Growth
... ... ... ...
9997 Oil Prices Economic Growth
9998 Baseball Giambi's contract
9999 Olympic Gold Medal European Football

Search for data semantically

Use Atlas to automatically find nearest neighbors in your vector database.

# Load map and perform vector search for the five nearest neighbors of datum with id "my_query_point"
map = dataset.maps[0]

with dataset.wait_for_dataset_lock():
  neighbors, _ = map.embeddings.vector_search(ids=['my_query_point'], k=5)

# Return similar data points
similar_datapoints = dataset.get_data(ids=neighbors[0])

print(similar_datapoints)
Response:

Original query point:
"Intel abandons digital TV chip project NEW YORK, October 22 (newratings.com) - Global semiconductor giant Intel Corporation (INTC.NAS) has called off its plan to develop a new chip for the digital projection televisions."

Nearest neighbors:
"Intel awaits government move on expensing options Figuring it's had enough of fighting over options, the chip giant is waiting to see what Congress comes up with."
"Citigroup Takes On Intel The financial services giant takes over non-memory semiconductor chip production."
"Intel Seen Readying New Wi-Fi Chips  SAN FRANCISCO (Reuters) - Intel Corp. this week is  expected to introduce a chip that adds support for a relatively  obscure version of Wi-Fi, analysts said on Monday, in a move  that could help ease congestion on wireless networks."
"Intel pledges to bring Itanic down to Xeon price-point EM64T a stand-in until the real anti-AMD64 kit arrives"

Background

Atlas is developed by the Nomic AI team, which is based in NYC. Nomic also developed and maintains GPT4All, an open-source LLM chatbot ecosystem.

Discussion

Join the discussion on our ๐Ÿ›– Discord to ask questions, get help, and chat with others about Atlas, Nomic, GPT4All, and related topics. Our doors are open to enthusiasts of all skill levels.

Community


Go to top

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.