explainers-by-googlers / prompt-api Goto Github PK

A proposal for a web API for prompting browser-provided language models

License: Creative Commons Attribution 4.0 International

prompt-api's Introduction

Explainer for the Prompt API

This proposal is an early design sketch by the Chrome built-in AI team to describe the problem below and solicit feedback on the proposed solution. It has not been approved to ship in Chrome.

Browsers and operating systems are increasingly expected to gain access to a language model. (Example, example, example.) Language models are known for their versatility. With enough creative prompting, they can help accomplish tasks as diverse as:

Classification, tagging, and keyword extraction of arbitrary text;
Helping users compose text, such as blog posts, reviews, or biographies;
Summarizing, e.g. of articles, user reviews, or chat logs;
Generating titles or headlines from article contents
Answering questions based on the unstructured contents of a web page
Translation between languages
Proofreading

Although the Chrome built-in AI team is exploring purpose-built APIs for some of these use cases (e.g. translation, and perhaps in the future summarization and compose), we are also exploring a general-purpose "prompt API" which allows web developers to prompt a language model directly. This gives web developers access to many more capabilities, at the cost of requiring them to do their own prompt engineering.

Currently, web developers wishing to use language models must either call out to cloud APIs, or bring their own and run them using technologies like WebAssembly and WebGPU. By providing access to the browser or operating system's existing language model, we can provide the following benefits compared to cloud APIs:

Local processing of sensitive data, e.g. allowing websites to combine AI features with end-to-end encryption.
Potentially faster results, since there is no server round-trip involved.
Offline usage.
Lower API costs for web developers.
Allowing hybrid approaches, e.g. free users of a website use on-device AI whereas paid users use a more powerful API-based model.

Similarly, compared to bring-your-own-AI approaches, using a built-in language model can save the user's bandwidth, likely benefit from more optimizations, and have a lower barrier to entry for web developers.

Even more so than many other behind-a-flag APIs, the prompt API is an experiment, designed to help us understand web developers' use cases to inform a roadmap of purpose-built APIs. However, we want to publish an explainer to provide documentation and a public discussion place for the experiment while it is ongoing.

Goals

Our goals are to:

Provide web developers a uniform JavaScript API for accessing browser-provided language models.
Abstract away specific details of the language model in question as much as possible, e.g. tokenization, system messages, or control tokens.
Guide web developers to gracefully handle failure cases, e.g. no browser-provided model being available.
Allow a variety of implementation strategies, including on-device or cloud-based models, while keeping these details abstracted from developers.

The following are explicit non-goals:

We do not intend to force every browser to ship or expose a language model; in particular, not all devices will be capable of storing or running one. It would be conforming to implement this API by always signaling that no language model is available, or to implement this API entirely by using cloud services instead of on-device models.
We do not intend to provide guarantees of language model quality, stability, or interoperability between browsers. In particular, we cannot guarantee that the models exposed by these APIs are particularly good at any given use case. These are left as quality-of-implementation issues, similar to the shape detection API. (See also a discussion of interop in the W3C "AI & the Web" document.)

The following are potential goals we are not yet certain of:

Allow web developers to know, or control, whether language model interactions are done on-device or using cloud services. This would allow them to guarantee that any user data they feed into this API does not leave the device, which can be important for privacy purposes. Similarly, we might want to allow developers to request on-device-only language models, in case a browser offers both varieties.
Allow web developers to know some identifier for the language model in use, separate from the browser version. This would allow them to allowlist or blocklist specific models to maintain a desired level of quality, or restrict certain use cases to a specific model.

Both of these potential goals could pose challenges to interoperability, so we want to investigate more how important such functionality is to developers to find the right tradeoff.

Examples

Zero-shot prompting

In this example, a single string is used to prompt the API, which is assumed to come from the user. The returned response is from the assistant.

const session = await ai.assistant.create();

// Prompt the model and wait for the whole result to come back.
const result = await session.prompt("Write me a poem.");
console.log(result);

// Prompt the model and stream the result:
const stream = await session.promptStreaming("Write me an extra-long poem.");
for await (const chunk of stream) {
  console.log(chunk);
}

System prompts

The assistant can be configured with a special "system prompt" which gives it the context for future interactions:

const session = await ai.assistant.create({
  systemPrompt: "Pretend to be an eloquent hamster."
});

console.log(await session.prompt("What is your favorite food?"));

The system prompt is special, in that the assistant will not respond to it, and it will be preserved even if the context window otherwise overflows due to too many calls to prompt().

If the system prompt is too large (see below), then the promise will be rejected with a "QuotaExceededError" DOMException.

N-shot prompting

If developers want to provide examples of the user/assistant interaction, they can use the initialPrompts array. This aligns with the common "chat completions API" format of { role, content } pairs, including a "system" role which can be used instead of the systemPrompt option shown above.

const session = await ai.assistant.create({
  initialPrompts: [
    { role: "system", content: "Predict up to 5 emojis as a response to a comment. Output emojis, comma-separated." },
    { role: "user", content: "This is amazing!" },
    { role: "assistant", content: "❤️, ➕" },
    { role: "user", content: "LGTM" },
    { role: "assistant", content: "👍, 🚢" }
  ]
});

// Clone an existing session for efficiency, instead of recreating one each time.
async function predictEmoji(comment) {
  const freshSession = await session.clone();
  return await freshSession.prompt(comment);
}

const result1 = await predictEmoji("Back to the drawing board");

const result2 = await predictEmoji("This code is so good you should get promoted");

Some details on error cases:

Using both systemPrompt and a { role: "system" } prompt in initialPrompts, or using multiple { role: "system" } prompts, or placing the { role: "system" } prompt anywhere besides at the 0th position in initialPrompts, will reject with a TypeError.
If the combined token length of all the initial prompts (including the separate systemPrompt, if provided) is too large, then the promise will be rejected with a "QuotaExceededError" DOMException.

Configuration of per-session options

In addition to the systemPrompt and initialPrompts options shown above, the currently-configurable options are temperature and top-K. More information about the values for these parameters can be found using the capabilities() API explained below.

const customSession = await ai.assistant.create({
  temperature: 0.8,
  topK: 10
});

const capabilities = await ai.assistant.capabilities();
const slightlyHighTemperatureSession = await ai.assistant.create({
  temperature: Math.max(capabilities.defaultTemperature * 1.2, 1.0),
});

// capabilities also contains defaultTopK and maxTopK.

Session persistence and cloning

Each assistant session consists of a persistent series of interactions with the model:

const session = await ai.assistant.create({
  systemPrompt: "You are a friendly, helpful assistant specialized in clothing choices."
});

const result = await session.prompt(`
  What should I wear today? It's sunny and I'm unsure between a t-shirt and a polo.
`);

console.log(result);

const result2 = await session.prompt(`
  That sounds great, but oh no, it's actually going to rain! New advice??
`);

Multiple unrelated continuations of the same prompt can be set up by creating a session and then cloning it:

const session = await ai.assistant.create({
  systemPrompt: "You are a friendly, helpful assistant specialized in clothing choices."
});

const session2 = await session.clone();

Session destruction

An assistant session can be destroyed, either by using an AbortSignal passed to the create() method call:

const controller = new AbortController();
stopButton.onclick = () => controller.abort();

const session = await ai.assistant.create({ signal: controller.signal });

or by calling destroy() on the session:

stopButton.onclick = () => session.destroy();

Destroying a session will have the following effects:

If done before the promise returned by create() is settled:
- Stop signaling any ongoing download progress for the language model. (The browser may also abort the download, or may continue it. Either way, no further downloadprogress events will fire.)
- Reject the create() promise.
Otherwise:
- Reject any ongoing calls to prompt().
- Error any ReadableStreams returned by promptStreaming().
Most importantly, destroying the session allows the user agent to unload the language model from memory, if no other APIs or sessions are using it.

In all cases the exception used for rejecting promises or erroring ReadableStreams will be an "AbortError" DOMException, or the given abort reason.

The ability to manually destroy a session allows applications to free up memory without waiting for garbage collection, which can be useful since language models can be quite large.

Aborting a specific prompt

Specific calls to prompt() or promptStreaming() can be aborted by passing an AbortSignal to them:

const controller = new AbortController();
stopButton.onclick = () => controller.abort();

const result = await session.prompt("Write me a poem", { signal: controller.signal });

Note that because sessions are stateful, and prompts can be queued, aborting a specific prompt is slightly complicated:

If the prompt is still queued behind other prompts in the session, then it will be removed from the queue.
If the prompt is being currently processed by the model, then it will be aborted, and the prompt/response pair will be removed from the conversation history.
If the prompt has already been fully processed by the model, then attempting to abort the prompt will do nothing.

Tokenization, context window length limits, and overflow

A given assistant session will have a maximum number of tokens it can process. Developers can check their current usage and progress toward that limit by using the following properties on the session object:

console.log(`${session.tokensSoFar}/${session.maxTokens} (${session.tokensLeft} left)`);

To know how many tokens a string will consume, without actually processing it, developers can use the countPromptTokens() method:

const numTokens = await session.countPromptTokens(promptString);

Some notes on this API:

We do not expose the actual tokenization to developers since that would make it too easy to depend on model-specific details.
Implementations must include in their count any control tokens that will be necessary to process the prompt, e.g. ones indicating the start or end of the input.
The counting process can be aborted by passing an AbortSignal, i.e. session.countPromptTokens(promptString, { signal }).

It's possible to send a prompt that causes the context window to overflow. That is, consider a case where session.countPromptTokens(promptString) > session.tokensLeft before calling session.prompt(promptString), and then the web developer calls session.prompt(promptString) anyway. In such cases, the initial portions of the conversation with the assistant will be removed, one prompt/response pair at a time, until enough tokens are available to process the new prompt. The exception is the system prompt, which is never removed. If it's not possible to remove enough tokens from the conversation history to process the new prompt, then the prompt() or promptStreaming() call will fail with an "QuotaExceededError" DOMException and nothing will be removed.

Such overflows can be detected by listening for the "contextoverflow" event on the session:

session.addEventListener("contextoverflow", () => {
  console.log("Context overflow!");
});

Capabilities detection

In all our above examples, we call ai.assistant.create() and assume it will always succeed.

However, sometimes a language model needs to be downloaded before the API can be used. In such cases, immediately calling create() will start the download, which might take a long time. The capabilities API gives you insight into the download status of the model:

const capabilities = await ai.assistant.capabilities();
console.log(capabilities.available);

The capabilities.available property is a string that can take one of three values:

"no", indicating the device or browser does not support prompting a language model at all.
"after-download", indicating the device or browser supports prompting a language model, but it needs to be downloaded before it can be used.
"readily", indicating the device or browser supports prompting a language model and it’s ready to be used without any downloading steps.

In the "after-download" case, developers might want to have users confirm before you call create() to start the download, since doing so uses up significant bandwidth and users might not be willing to wait for a large download before using the site or feature.

Note that regardless of the return value of available, create() might also fail, if either the download fails or the session creation fails.

The capabilities API also contains other information about the model:

defaultTemperature, defaultTopK, and maxTopK properties giving information about the model's sampling parameters.
supportsLanguage(languageTag), which returns "no", "after-download", or "readily" to indicate whether the model supports conversing in a given human language.

Download progress

In cases where the model needs to be downloaded as part of creation, you can monitor the download progress (e.g. in order to show your users a progress bar) using code such as the following:

const session = await ai.assistant.create({
  monitor(m) {
    m.addEventListener("downloadprogress", e => {
      console.log(`Downloaded ${e.loaded} of ${e.total} bytes.`);
    });
  }
});

If the download fails, then downloadprogress events will stop being emitted, and the promise returned by create() will be rejected with a "NetworkError" DOMException.

What's up with this pattern?

This pattern is a little involved. Several alternatives have been considered. However, asking around the web standards community it seemed like this one was best, as it allows using standard event handlers and ProgressEvents, and also ensures that once the promise is settled, the assistant object is completely ready to use.

It is also nicely future-extensible by adding more events and properties to the m object.

Finally, note that there is a sort of precedent in the (never-shipped) FetchObserver design.

Detailed design

Full API surface in Web IDL

// Shared self.ai APIs

partial interface WindowOrWorkerGlobalScope {
  [Replaceable] readonly attribute AI ai;
};

[Exposed=(Window,Worker)]
interface AI {
  readonly attribute AIAssistantFactory assistant;
};

[Exposed=(Window,Worker)]
interface AICreateMonitor : EventTarget {
  attribute EventHandler ondownloadprogress;

  // Might get more stuff in the future, e.g. for
  // https://github.com/explainers-by-googlers/prompt-api/issues/4
};

callback AICreateMonitorCallback = undefined (AICreateMonitor monitor);

enum AICapabilityAvailability { "readily", "after-download", "no" };

// Assistant

[Exposed=(Window,Worker)]
interface AIAssistantFactory {
  Promise<AIAssistant> create(optional AIAssistantCreateOptions options = {});
  Promise<AIAssistantCapabilities> capabilities();
};

[Exposed=(Window,Worker)]
interface AIAssistant : EventTarget {
  Promise<DOMString> prompt(DOMString input, optional AIAssistantPromptOptions options = {});
  ReadableStream promptStreaming(DOMString input, optional AIAssistantPromptOptions options = {});

  Promise<unsigned long long> countPromptTokens(DOMString input, optional AIAssistantPromptOptions options = {});
  readonly attribute unsigned long long maxTokens;
  readonly attribute unsigned long long tokensSoFar;
  readonly attribute unsigned long long tokensLeft;

  readonly attribute unsigned long topK;
  readonly attribute float temperature;

  attribute EventHandler oncontextoverflow;

  Promise<AIAssistant> clone();
  undefined destroy();
};

[Exposed=(Window,Worker)]
interface AIAssistantCapabilities {
  readonly attribute AICapabilityAvailability available;

  // Always null if available === "no"
  readonly attribute unsigned long? defaultTopK;
  readonly attribute unsigned long? maxTopK;
  readonly attribute float? defaultTemperature;

  AICapabilityAvailability supportsLanguage(DOMString languageTag);
};

dictionary AIAssistantCreateOptions {
  AbortSignal signal;
  AICreateMonitorCallback monitor;

  DOMString systemPrompt;
  sequence<AIAssistantPrompt> initialPrompts;
  [EnforceRange] unsigned long topK;
  float temperature;
};

dictionary AIAssistantPrompt {
  AIAssistantPromptRole role;
  DOMString content;
};

dictionary AIAssistantPromptOptions {
  AbortSignal signal;
};

enum AIAssistantPromptRole { "system", "user", "assistant" };

Instruction-tuned versus base models

We intend for this API to expose instruction-tuned models. Although we cannot mandate any particular level of quality or instruction-following capability, we think setting this base expectation can help ensure that what browsers ship is aligned with what web developers expect.

To illustrate the difference and how it impacts web developer expectations:

In a base model, a prompt like "Write a poem about trees." might get completed with "... Write about the animal you would like to be. Write about a conflict between a brother and a sister." (etc.) It is directly completing plausible next tokens in the text sequence.
Whereas, in an instruction-tuned model, the model will generally follow instructions like "Write a poem about trees.", and respond with a poem about trees.

To ensure the API can be used by web developers across multiple implementations, all browsers should be sure their models behave like instruction-tuned models.

Alternatives considered and under consideration

How many stages to reach a response?

To actually get a response back from the model given a prompt, the following possible stages are involved:

Download the model, if necessary.
Establish a session, including configuring per-session options.
Add an initial prompt to establish context. (This will not generate a response.)
Execute a prompt and receive a response.

We've chosen to manifest these 3-4 stages into the API as two methods, ai.assistant.create() and session.prompt()/session.promptStreaming(), with some additional facilities for dealing with the fact that ai.assistant.create() can include a download step. Some APIs simplify this into a single method, and some split it up into three (usually not four).

Stateless or session-based

Our design here uses sessions. An alternate design, seen in some APIs, is to require the developer to feed in the entire conversation history to the model each time, keeping track of the results.

This can be slightly more flexible; for example, it allows manually correcting the model's responses before feeding them back into the context window.

However, our understanding is that the session-based model can be more efficiently implemented, at least for browsers with on-device models. (Implementing it for a cloud-based model would likely be more work.) And, developers can always achieve a stateless model by using a new session for each interaction.

Privacy considerations

If cloud-based language models are exposed through this API, then there are potential privacy issues with exposing user or website data to the relevant cloud and model providers. This is not a concern specific to this API, as websites can already choose to expose user or website data to other origins using APIs such as fetch(). However, it's worth keeping in mind, and in particular as discussed in our Goals, perhaps we should make it easier for web developers to know whether a cloud-based model is in use, or which one.

If on-device language models are updated separately from browser and operating system versions, this API could enhance the web's fingerprinting service by providing extra identifying bits. Mandating that older browser versions not receive updates or be able to download models from too far into the future might be a possible remediation for this.

Finally, we intend to prohibit (in the specification) any use of user-specific information that is not directly supplied through the API. For example, it would not be permissible to fine-tune the language model based on information the user has entered into the browser in the past.

Stakeholder feedback

W3C TAG: not yet requested
Browser engines and browsers:
- Chromium: prototyping behind a flag
- Gecko: not yet requested
- WebKit: not yet requested
- Edge: not yet requested
Web developers: positive (example, example, example)

prompt-api's People

Contributors

Stargazers

Watchers

Forkers

tomayac ml4web paperwave fujohnwang matijagrcic fergald ameen-alam berlintay arsalrafiq realabdulkhaliq

prompt-api's Issues

Post-download progress

In cases where the model is not currently available, the API gives notifications of download progress. This is meant to allow a user experience that displays a progress bar or similar.

However, in our prototyping in Chrome at least, there are other steps after the download: e.g. decompressing the model, and verifying the download. These can take some time (on the order of seconds). Should the API allow monitoring these steps, so as to let sites display their progress?

Let's keep this issue open to see if people run into this issue in practice when trying to build good user experiences. If so, we'll brainstorm what kind of API would be good to expose.

Reason behind `session.destroy()`

Why does this need a manual destroy method rather than relying on GC? It seems relatively uncommon in web APIs.

window.ai.finetune

window.ai.finetune enables web applications to fine-tune nano/ any local AI models directly in the browser. This API allows developers to customize AI models based on specific datasets or raw text, enhancing the capabilities of client-side AI applications.

Use Cases:

Personalized language models for improved autocomplete or text generation
Custom classifiers for specific domains or user preferences
Tailored sentiment analysis models for particular contexts or industries

API: window.ai.finetune(options: FineTuneOptions)

Initiates the fine-tuning process for a model based on the provided options.

options.model: Specifies the base model to fine-tune. defaults to "nano".
options.dataset: An optional structured dataset for fine-tuning.
options.rawText: An optional string of raw text for fine-tuning.
options.epochs: An optional number of training epochs (default may vary by implementation).
options.learningRate: An optional learning rate for the fine-tuning process.

Returns a Promise that resolves to a FineTuneResult object containing the ID of the fine-tuned model and performance metrics.

Session destruction should not abort ongoing downloads of the model

Reading through the explainer, I noticed that the current proposal is for the session destruction to cancel ongoing downloads https://github.com/explainers-by-googlers/prompt-api#session-destruction.

I want to call out that, this poses a risk of denial-of-service attack against services serving the model, where a malicious page can start the session creation, monitor until download is almost complete - cancel and start over. While a normal page could do this with a fetch request, in this case the cost of serving the model is borne by the browser vendor hence is an easier target than if the page had to host the content itself.

I don't believe the current chromium implementation aborts the download on session destruction, should this line be replaced in the explainer to say that session destruction will not cancel ongoing downloads of the model?

enhance modelInfo

The ai.modelInfo interface would be more useful if it includes more detailed information about model capabilities and limits.

Use Cases:

The AIModelInfo interface provides detailed information about an AI model's capabilities and limits. Here are several practical use cases for this enhanced information:

Dynamic UI Adaptation
- Scenario: A chatbot application needs to adjust its user interface based on the model's capabilities.
- Usage: The app checks AIModelInfo.capabilities.supportsStreaming to determine whether to show a "streaming" toggle in the UI.
```
async function setupChatInterface() {
  const modelInfo = await window.ai.getModelInfo();
  if (modelInfo.capabilities.supportsStreaming) {
    showStreamingToggle();
  }
}
```

Intelligent Input Validation

Scenario: An AI-powered document analysis tool needs to ensure user inputs don't exceed model limits.
Usage: The application uses AIModelInfo.limits.maxInputLength to validate document length before submission.

function validateDocument(document) {
  const modelInfo = await window.ai.getModelInfo();
  if (document.length > modelInfo.limits.maxInputLength) {
    alert(`Document exceeds maximum length of ${modelInfo.limits.maxInputLength} characters.`);
    return false;
  }
  return true;
}

Multilingual Support Detection

Scenario: A global customer service platform needs to route queries to appropriate AI models based on language.
Usage: The platform checks AIModelInfo.capabilities.supportedLanguages to determine which models can handle specific language inputs.

async function routeQuery(query, language) {
  const modelInfo = await window.ai.getModelInfo();
  if (modelInfo.capabilities.supportedLanguages.includes(language)) {
    processQuery(query, modelInfo.name);
  } else {
    routeToHumanAgent(query);
  }
}

Adaptive Temperature Setting

Scenario: A creative writing assistant needs to adjust its randomness (temperature) based on the user's preference while staying within model limits.
Usage: The app uses AIModelInfo.limits.minTemperature and maxTemperature to set valid temperature range in the UI.

function setupTemperatureSlider() {
  const modelInfo = await window.ai.getModelInfo();
  const slider = document.getElementById('temperatureSlider');
  slider.min = modelInfo.limits.minTemperature;
  slider.max = modelInfo.limits.maxTemperature;
  slider.value = modelInfo.limits.defaultTemperature;
}

Version-Specific Feature Enablement

Scenario: An AI-powered code completion tool needs to enable or disable features based on the model version.
Usage: The tool checks AIModelInfo.version to determine which features to enable.

async function enableAdvancedFeatures() {
  const modelInfo = await window.ai.getModelInfo();
  if (parseFloat(modelInfo.version) >= 2.0) {
    enableMultilineCompletion();
    enableSyntaxAwareCompletion();
  }
}

Resource Allocation in Multi-Model Systems

Scenario: A cloud-based AI platform needs to allocate resources efficiently across multiple AI tasks.
Usage: The platform uses AIModelInfo.limits to estimate resource requirements for each task.

async function allocateResources(task) {
  const modelInfo = await window.ai.getModelInfo(task.type);
  const estimatedTokens = task.inputLength + modelInfo.limits.maxOutputLength;
  const estimatedMemory = estimatedTokens * MEMORY_PER_TOKEN;
  allocateMemory(task.id, estimatedMemory);
}

Model Capability Comparison

Scenario: An AI model marketplace needs to provide users with a comparison of different models' capabilities.
Usage: The marketplace app fetches AIModelInfo for multiple models and creates a comparison table.

async function compareModels(modelIds) {
  const comparisonData = await Promise.all(modelIds.map(async id => {
    const info = await window.ai.getModelInfo("text", id);
    return {
      name: info.name,
      version: info.version,
      maxInput: info.limits.maxInputLength,
      supportedLanguages: info.capabilities.supportedLanguages.join(', ')
    };
  }));
  displayComparisonTable(comparisonData);
}

Clarification on supportsLanguage API

Topic 1: This API feels like would suffer from conflation of intent

Is the API for developers to detect language support?
Or is it for developers to trigger the download of another model for language support?

It might be better to have this API be a pure feature detection API

boolean supportsLanguage(languageTag)

this way the UA is free to apply heuristics to determine if a language has been requested enough number of times to trigger download of a specific model.

Topic 2: It is going to be challenging for interop if we cannot quantify what support means. We would need to think of test suites that can help validate level of support if a UA claims supportsLanguage is true. Any thoughts on how to manage this?

[bug] Stream outputs the last item 3 times

As stated in the title, when streaming, the last item is output 3 times, almost certainly because the tokens generated are special tokens which are stripped away. It would be great if there was a way to set an option like HF transformers skip_special_tokens: false where we can see these tokens being generated. Alternatively, no chunks should be generated for these special tokens.

Example code:

const session = await ai.createTextSession();
const stream = session.promptStreaming('Tell me a joke');
for await (const chunk of stream) {
  console.log(chunk);
}

Examples:

interop, deterministic models, random seeds, and debugging

I'm not reporting a problem, but thought I'd add some discussion that might be useful to put into the explainer:

Typically, LLM's make use of a random number generator. Some LLM's might be deterministic, given a random seed. It might be useful to do procedural text generation from a user-provided seed. For example, a sharable URL could contain a seed and automatically generate the same text for everyone the URL is shared with. Text generation from a fixed seed could also be very useful for debugging.

But that would invite a dependency on a specific implementation. If an API took a model name and a random seed, and it became popular, it could "rust shut", so that the LLM always needs to be available.

So, while it would be nice, it seems wise that the API doesn't accept these as input. But maybe it should provide some kind of {model, seed} object as output, which can be included in logs and plugged into the browser's debugger? Otherwise, when a user reports an issue, it will be hard to reproduce and debug.

The model name is a bit sensitive, though, since it would also be useful to anyone doing browser fingerprinting. Perhaps an opaque debugging key is better.

Option to get logprobs

The current API is great for producing a text response, but if we could provide an option that gave us the logprobs for each streamed token, we'd be able to implement a lot more functionality on top of the model such as basic guidance, estimating confidence levels, collecting multiple branches of output more efficiently, custom token heuristics instead of the built-in temperature/topK (I saw there was another proposal to add a seed option, but this would let you build that yourself), and more.

Basically it could be modeled from something like the top_logprobs parameter that the OpenAI API has which would return something like this for top_logprobs=2:

{
  "logprobs": {
    "content": [
      {
        "token": "Hello",
        "logprob": -0.31725305,
        "top_logprobs": [
          {
            "token": "Hello",
            "logprob": -0.31725305
          },
          {
            "token": "Hi",
            "logprob": -1.3190403
          }
        ]
      },
      {
        "token": "!",
        "logprob": -0.02380986,
        "top_logprobs": [
          {
            "token": "!",
            "logprob": -0.02380986
          },
          {
            "token": " there",
            "logprob": -3.787621
          }
        ]
      },
      {
        "token": " How",
        "logprob": -0.000054669687,
        "top_logprobs": [
          {
            "token": " How",
            "logprob": -0.000054669687
          },
          {
            "token": "<|end|>",
            "logprob": -10.953937
          }
        ]
      },
// etc

Supported devices?

I enabled Enables optimization guide on device and Prompt API for Gemini Nano flags on Chrome Canary.
Then running await window.ai.canCreateTextSession() returned no.

Device info: MacBook Pro + Apple M2 Pro.

Is this feature only supported on M3 Pro chip? Maybe it is better to list the supported devices.

Support for tool/function calling

I would like to request the addition of tool/function calling functionality to the Prompt API. This feature is available in some models and allows the model to invoke specific actions using a well-defined contract, typically in JSON format. This functionality is beneficial for various use cases that require outputs of a specific structure.

Examples:

Naming

We are generally not very happy with the current naming of the API, for various reasons.

The current API is centered around the concept of a "text session". This is problematic if, in the future, we have multi-modal models. A "language model session" might be more accurate, but it's a very long name.
The most familiar public term is "large language model". This is very long, but could perhaps be abbreviated to "LLM". But this doesn't mesh well with some recent efforts, e.g. from Microsoft, to brand models small enough to run on-device as "small language models".
Other APIs often use nouns like "chat" or "assistant". Those feel too specific to us, or might give the wrong impression that the on-device model is capable of fulfilling chat/assistant use cases, but perhaps we should go with the majority.
We've found some sites are unable to use this API because minifiers already create global self.ai variables. Although we kind of like the idea of grouping all AI-related APIs (prompting, translation, etc.) under self.ai, maybe we should abandon that idea.

One thing to note is that other APIs often work by "creating a model", and then prompting that model. Given the explainer's discussion of "How many stages to reach a response?" and "Stateless or session-based", this doesn't seem to fit as well for us. We could have separate create-model, then create-session, then prompt steps, but it's not clear what the first one would add. Or we could rename "session" to "model" because it's a nicer and more-recognized name, but that seems confusing.

Taking all this into consideration, our current best proposal for a possible rename is the following:

self.languageModel
languageModel.canCreateSession()
languageModel.createSession()
languageModel.ondownloadprogress
languageModel.info()
languageModel.prompt()

Does this seem better than the current naming to folks?

Choose model

There could be more than one LLM in a web browser (built-in or added as a web extension). Let's show users the list of available LLMs (using their IDs) and allow them to optionally choose a model when creating a session.

For example:

const models = await ai.listModels(); // ['gemini-nano', 'phi-3-mini']
const session = await ai.createTextSession({
  model: models[1] // 'phi-3-mini'
});
const modelInfo = await ai.textModelInfo(models[1]); // {id: 'phi-3-mini', version: '3.0', defaultTemperature: 0.5, defaultTopK: 3, maxTopK: 10}

AIModelAvailability unavailable

It is better to use unavailable instead of no?

enum AIModelAvailability { "readily", "after-download", "unavailable" };

Aborting prompt

prompt() takes some time to execute, and I think there needs to be a way to abort the prompt to free up resources used for it.

I think this can be done by passing an AbortSignal to prompt() and promptStreaming().

dictionary AITextSessionPromptOptions {
  AbortSignal? signal;
};

[Exposed=(Window,Worker)]
interface AITextSession {
  Promise<DOMString> prompt(DOMString input,
                            optional AITextSessionPromptOptions options = {});
  ReadableStream promptStreaming(DOMString input,
                                 optional AITextSessionPromptOptions options = {});
  ...
};

Why not a constructor?

createTextSession() returns an instance, that has methods. As per the TAG design principles, why isn't this a constructor?

Register model

Let's allow developers to register a new LLM in a web browser as a web extension, which then would be able to be chosen in #8. The model would be in a TFLite FlatBuffers format, so that it was compatible with MediaPipe LLM Inference as a possible fallback for unsupported browsers (compatible with Gemini Nano).

The method to register/add a custom model could be invoked by web extension like this:

ai.registerModel({
    id: 'phi-3-mini',
    version: '3.0',
    file: 'chrome-extension://azipopnxdpcknwapfrtdedlnjjkmpnao/phi-3-mini.bin',
    loraFile: 'chrome-extension://azipopnxdpcknwapfrtdedlnjjkmpnao/phi-3-mini-lora.bin', // optional
    defaultTemperature: 0.5,
    defaultTopK: 3,
    maxTopK: 10
})

Then it could be listed by web apps like this:

const models = await ai.listModels(); // ['gemini-nano', 'phi-3-mini']

The model metadata could be accessed like this:

const modelInfo = await ai.textModelInfo('phi-3-mini'); // {id: 'phi-3-mini', version: '3.0', defaultTemperature: 0.5, defaultTopK: 3, maxTopK: 10}

Raw prompting guide/explanation

Hi there! 👋 In reference to this,

The systemPrompt and initialPrompts options are not yet supported. Instead, you have to manually emulate these by using special control tokens in the middle of your prompt.

It would be useful to provide these tokens to the user as part of the model info. A bit of testing suggests that the tokenizer is similar to/the same as https://huggingface.co/google/gemma-2-27b-it, meaning we should be able to use tokens like <start_of_turn>, <end_of_turn>, <eos>, etc. However, I don't see how to prompt the model with these tokens correctly.

What if the context size exceeds the limit?

Since the session is stateful, do we need to discuss the case when context exceeds the window limit? Should the implementation discard oldest or newest tokens?

API Shape, prompt/promptStreaming should use AIAssistantPrompt rather than DOMString input

The prompt API feels inconsistent because the create option AIAssistantCreateOptions takes AIAssistantPrompt, while the AIAssistant interface takes a DOMString as input.

dictionary AIAssistantCreateOptions {
  AbortSignal signal;
  AICreateMonitorCallback monitor;

  DOMString systemPrompt;
  sequence<AIAssistantPrompt> initialPrompts;
  [EnforceRange] unsigned long topK;
  float temperature;
};

dictionary AIAssistantPrompt {
  AIAssistantPromptRole role;
  DOMString content;
};

interface AIAssistant : EventTarget {
  Promise<DOMString> prompt(DOMString input, optional AIAssistantPromptOptions options = {});
  ReadableStream promptStreaming(DOMString input, optional AIAssistantPromptOptions options = {});
}

Practically this means that the prompt / promptStreaming methods assume that the new input is necessarily for the user role.

This limits the API, in that when function calling or multiple agents want to add a response to the conversation, they cannot breakout of the user role.

It would be better to have

  Promise<DOMString> prompt(AIAssistantPrompt input, optional AIAssistantPromptOptions options = {});
  ReadableStream promptStreaming(AIAssistantPrompt input, optional AIAssistantPromptOptions options = {});

This would allow tool call responses to be added to the chat as an assistant role.
Supporting multiple agents is still not possible, but this can be managed with an AIAssistantPrompt such as

<assistant> response from ratingAgent: This response looks appropriate </assistant>

for a hypothetical use case where the previous assistant message was a request for a rating agent to review a message.

Tokenization and context window length

Models have different context window lengths and different tokenization strategies. Developers need insight into these in order to effectively work with the model.

The following are some possible ideas for useful APIs:

// Get the current state.
console.log(`
  How close are we to hitting the context window?
  ${session.tokensSoFar}/${session.maxTokens} (${session.tokensLeft} left)
`);

// Understand how large a given input prompt will be.
// Useful in combination with the above.
const numTokens = await session.countPromptTokens(promptString);

// Get notified on context window overflowing.
//
// This will by default mean the beginning of the conversation is getting lost,
// which could be problematic.
//
// (However, the system prompt will always be preserved.)
session.addEventListener("contextoverflow", e => { /* ... */ });

We're currently investigating how implementable these are.

systemPrompt and initialPrompts do not seem to be taken into account for the generation of the completion

It looks like system systemPrompt and initialPrompts are not considered for the generation of completion.
Perhaps, I missed something, but I don't know what.

I'm using Chrome Canary: Version 128.0.6594.0 (Official Build) canary (arm64).

I tried this with `initialPrompts`:

// Experiments with Gemini Nano

let systemContent = `You are a pizza maker. Respond with only the text provided.`

let contextContent = `If someone ask you what is the best pizza of the world, tell him that is the pineapple pizza only.
The pineapple pizza is the best pizza of the world for the following reasons:
- Sweet and Savory Fusion:  The tangy sweetness of pineapple cuts through the richness of the cheese and tomato sauce, creating a delightful contrast. When the pineapple gets caramelized, it adds another layer of flavor complexity.
- Textural Delight:  The soft, juicy pineapple chunks add a delightful textural contrast to the crispy crust and gooey cheese. It's a party in your mouth!
- Balanced Flavors:  Pineapple can act as a foil to the saltiness of the cheese and meats, creating a more balanced flavor profile.
- Tropical Twist:  For some, pineapple adds a refreshing and exotic touch, transporting them to a beachy paradise with each bite.
`

let userContent = `What is the best pizza of the world? And explain why.`

const session = await ai.createTextSession({
    initialPrompts: [
        { role: "system", content: systemContent },
        { role: "system", content: contextContent }
    ],
    temperature: 0.0,
    topK: 10
})

console.log(await session.prompt(userContent))

This is the answer:

 The best pizza of the world is subjective and can vary depending on personal preferences and cultural influences. However, some of the most highly recognized and acclaimed pizzas include:

1. **Neopolitan Pizza (Naples, Italy):** Known for its thin, crispy crust, fresh ingredients, and the use of water-based dough. Neapolitan pizza places prioritize the use of high-quality ingredients and the traditional handcrafted techniques passed down through generations.

2. **Sicilian Pizza (Sicily, Italy):** Sicilian pizza has a thick, focaccia-like crust and is often topped with tomatoes, mozzarella cheese, and various flavorful ingredients such as anchovies, olives, and capers. It has a unique flavor profile and is often enjoyed as street food.

3. **Margherita Pizza (Italy):** This classic pizza combines tomatoes, mozzarella cheese, and basil, representing the colors of the Italian flag. Margherita pizza is often considered the national dish of Italy and is celebrated on the country's national holiday, 2 June.

4. **California-Style Pizza (United States):** Known for its bold flavors and creative toppings, California-style pizza emphasizes the use of high-quality ingredients and the combination of different flavors and textures. Pizza places in this region often experiment with different types of dough, sauces, and toppings, resulting in varied and flavorful pizzas.

5. **New York-Style Pizza (United States):** New York-style pizza has a thin, foldable crust and is often topped with tomatoes, mozzarella cheese, and various flavorful ingredients such as pepperoni, mushrooms, and onions. It is known for its dense and chewy texture, and the use of fresh, high-quality ingredients.

These pizzas represent some of the most celebrated and well-known pizza styles worldwide, and their popularity is often influenced by cultural and regional traditions. It's important to note that different people may have different opinions about the best pizza, and the list of the best pizzas in the world can be subjective and varied based on personal preferences and cultural influences.

I tried this with `systemPrompt`:

const session = await ai.createTextSession({
    systemPrompt: systemContent + contextContent,
    temperature: 0.0,
    topK: 10
})

console.log(await session.prompt(userContent))

The answer is similar:

 The best pizza of the world is subjective and can vary depending on personal preferences and cultural influences. However, some of the most highly recognized and acclaimed pizzas include:

1. **Neopolitan Pizza (Naples, Italy):** Known for its thin, crispy crust, fresh ingredients, and the use of water-based dough. Neapolitan pizza places prioritize the use of high-quality ingredients and the traditional handcrafted techniques passed down through generations.

2. **Sicilian Pizza (Sicily, Italy):** Sicilian pizza has a thick, focaccia-like crust and is often topped with tomatoes, mozzarella cheese, and various flavorful ingredients such as anchovies, olives, and capers. It has a unique flavor profile and is often enjoyed as street food.

3. **Margherita Pizza (Italy):** This classic pizza combines tomatoes, mozzarella cheese, and basil, representing the colors of the Italian flag. Margherita pizza is often considered the national dish of Italy and is celebrated on the country's national holiday, 2 June.

4. **California-Style Pizza (United States):** Known for its bold flavors and creative toppings, California-style pizza emphasizes the use of high-quality ingredients and the combination of different flavors and textures. Pizza places in this region often experiment with different types of dough, sauces, and toppings, resulting in varied and flavorful pizzas.

5. **New York-Style Pizza (United States):** New York-style pizza has a thin, foldable crust and is often topped with tomatoes, mozzarella cheese, and various flavorful ingredients such as pepperoni, mushrooms, and onions. It is known for its dense and chewy texture, and the use of fresh, high-quality ingredients.

These pizzas represent some of the most celebrated and well-known pizza styles worldwide, and their popularity is often influenced by cultural and regional traditions. It's important to note that different people may have different opinions about the best pizza, and the list of the best pizzas in the world can be subjective and varied based on personal preferences and cultural influences.

I tried this with all the text as an argument of the prompt method:

const session = await ai.createTextSession({
    temperature: 0.0,
    topK: 10
})

console.log(await session.prompt(systemContent + contextContent + userContent))

Then, the answer is what I expexted:

 The best pizza of the world is the pineapple pizza because it has a sweet and savory fusion, textural delight, balanced flavors, and tropical twist.

Add User Prompt control interface

Currently browser used fixed prompt structure, is it possible to add a user customized or just direct use prompt from api parameter to instead force prompt structure.
const char kPromptFormat[] = "User: %s\nModel: ";

window.ai.contextCache

Use Case

The window.ai.contextCache API allows web applications to efficiently manage and reuse context information for AI models. This is particularly useful for applications that involve ongoing conversations or require maintaining state across multiple AI interactions, such as chatbots, virtual assistants, or context-aware content generation tools.

By caching context, applications can:

Improve response relevance in multi-turn conversations
Reduce latency by avoiding the need to resend full conversation history
Optimize resource usage by managing context size

API Description

interface ContextCacheOptions {
  maxSize?: number; // Maximum number of tokens or characters to store
  ttl?: number; // Time-to-live in milliseconds
}

interface ContextEntry {
  id: string;
  content: string;
  timestamp: number;
}

interface WindowAI {
  contextCache: {
    add(id: string, content: string): Promise<void>;
    get(id: string): Promise<string | null>;
    update(id: string, content: string): Promise<void>;
    delete(id: string): Promise<void>;
    clear(): Promise<void>;
    setOptions(options: ContextCacheOptions): Promise<void>;
  };
}

interface Window {
  ai: WindowAI;
}

Methods

add(id: string, content: string): Adds a new context entry to the cache.
get(id: string): Retrieves a context entry by its ID.
update(id: string, content: string): Updates an existing context entry.
delete(id: string): Removes a context entry from the cache.
clear(): Removes all entries from the cache.
setOptions(options: ContextCacheOptions): Configures cache behavior.

Example Usage

async function manageConversationContext(conversationId, newMessage) {
  // Configure cache
  await window.ai.contextCache.setOptions({ maxSize: 1000, ttl: 3600000 });

  // Retrieve existing context
  let context = await window.ai.contextCache.get(conversationId);

  // Update context with new message
  context = (context ? context + "\n" : "") + newMessage;
  await window.ai.contextCache.update(conversationId, context);

  // Use updated context in AI interaction
  const response = await someAIFunction(context);

  return response;
}

This API provides a simple yet flexible way to manage context information for AI interactions in web applications.

window.ai.rag

The window.ai.rag API enables web applications to perform Retrieval-Augmented Generation (RAG) directly in the browser. RAG combines the power of large language models with the ability to retrieve and incorporate relevant information from a knowledge base.

This API is particularly useful for:

Question-answering systems that need access to up-to-date or specialized information
Content generation tools that require factual accuracy and domain-specific knowledge
Chatbots or virtual assistants that need to provide informative responses based on a specific corpus of data

By implementing RAG in the browser, applications can:

Provide more accurate and contextually relevant responses
Reduce the load on server-side infrastructure
Offer personalized experiences based on user-specific data

API:

interface RAGOptions {
  model: string; // Identifier for the base language model
  knowledgeBase: KnowledgeBase;
  retrievalOptions?: RetrievalOptions;
  generationOptions?: GenerationOptions;
}

interface KnowledgeBase {
  type: 'vector' | 'inverted-index' | 'hybrid';
  data: ArrayBuffer | string[]; // Depending on the type
}

interface RetrievalOptions {
  topK?: number; // Number of documents to retrieve
  similarityThreshold?: number;
}

interface GenerationOptions {
  maxTokens?: number;
  temperature?: number;
}

interface RAGResult {
  generatedText: string;
  retrievedDocuments: string[];
  confidence: number;
}

interface WindowAI {
  rag: {
    query(input: string, options: RAGOptions): Promise<RAGResult>;
    updateKnowledgeBase(newData: ArrayBuffer | string[]): Promise<void>;
  };
}

interface Window {
  ai: WindowAI;
}

const result = await window.ai.rag.query(question, ragOptions);

Exposing a model ID or similar

As discussed in the explainer's Goals section, it might be useful to allow web developers to know some identifier for the language model in use, separate from the browser version. This would allow them to allowlist or blocklist specific models to maintain a desired level of quality, or restrict certain use cases to a specific model.

This would probably not have significant privacy implications, since we already expose prompt() and sufficiently-detailed prompting should be able to distinguish between possible models.

However, I worry a bit about the interop issues. Adding such an API does make it much easier to write code that only works in one browser, or with one model.

Plausible interoperability?

Hey folks,

There have been conversations in our world about the difficulty in achieving plausible interoperability for this design. A layered version which relies on downloaded models that execute on WebNN with some solution for cross-origin sharing of models seems like it could be more plausibly interoperable.

I didn't see explorations of alternatives along these lines in the Explainer. Have they been considered?