Code Monkey home page Code Monkey logo

opening-up-chatgpt / opening-up-chatgpt.github.io Goto Github PK

View Code? Open in Web Editor NEW
64.0 5.0 2.0 1.85 MB

Tracking instruction-tuned LLM openness. Paper: Liesenfeld, Andreas, Alianda Lopez, and Mark Dingemanse. 2023. “Opening up ChatGPT: Tracking Openness, Transparency, and Accountability in Instruction-Tuned Text Generators.” In Proceedings of the 5th International Conference on Conversational User Interfaces. doi:10.1145/3571884.3604316.

Home Page: https://opening-up-chatgpt.github.io/

License: Apache License 2.0

Python 100.00%
chatgpt llm open-source rlhf transparency chatgpt-free

opening-up-chatgpt.github.io's Introduction

logo Opening up ChatGPT — tracking openness of instruction-tuned LLMs — openness leaderboard

Liesenfeld, Andreas, Alianda Lopez, and Mark Dingemanse. 2023. “Opening up ChatGPT: Tracking Openness, Transparency, and Accountability in Instruction-Tuned Text Generators.” In Proceedings of the 5th International Conference on Conversational User Interfaces. Eindhoven. doi:10.1145/3571884.3604316. (PDF)

Large language models that exhibit instruction-following behaviour represent one of the biggest recent upheavals in conversational interfaces, a trend in large part fuelled by the release of OpenAI's ChatGPT, a proprietary large language model for text generation fine-tuned through reinforcement learning from human feedback (LLM+RLHF). We review the risks of relying on proprietary software and survey the first crop of open-source projects of comparable architecture and functionality. The main contribution of this paper is to show that openness is differentiated, and to offer scientific documentation of degrees of openness in this fast-moving field. We evaluate projects in terms of openness of code, training data, model weights, reinforcement learning data, licensing, scientific documentation, and access methods. We find that while there is a fast-growing list of projects billing themselves as 'open source', many inherit undocumented data of dubious legality, few share the all-important RLHF components (a key site where human labour is involved), and careful scientific documentation is exceedingly rare. Degrees of openness are relevant to fairness and accountability at all points, from data collection and curation to model architecture, and from training and fine-tuning to release and deployment.

Contents

Overview

We classify projects for their degrees of openness across a predefined set of criteria in the areas of Availability, Documentation and Access. The criteria are described in detail here.

Availability Documentation Access
  • Open code
  • LLM data
  • LLM weights
  • RL data
  • RL weights
  • License
  • Code
  • Architecture
  • Preprint
  • Paper
  • Model card
  • Data sheet
  • Package
  • API

If you find any of this useful, please cite our paper:

@inproceedings{liesenfeld_opening_2023,
	address = {Eindhoven},
	title = {Opening up {ChatGPT}: tracking openness, transparency, and accountability in instruction-tuned text generators},
	url = {https://opening-up-chatgpt.github.io},
	doi = {10.1145/3571884.3604316},
	booktitle = {Proceedings of the 5th {International} {Conference} on {Conversational} {User} {Interfaces}},
	publisher = {Association for Computing Machinery},
	author = {Liesenfeld, Andreas and Lopez, Alianda and Dingemanse, Mark},
	year = {2023},
	pages = {1--6},
}

How to contribute

If you know of a new instruction-tuned LLM+RLHF model we should be including, you can also add an issue.

How to contribute to the live table:

  1. Fork the repo and edit an existing yaml file or create a new one based on the sample yaml file in /projects
  2. File a pull request to have your changes reviewed and, hopefully, merged into main.

The live table is updated whenever there is a change to the files in the /projects/ folder.

Related resources

We try to be fairly systematic in our coverage of LLM+RLHF models, documenting degrees of openness for >10 features. There are many other resources that provide more free-form listings of relevant stuff or that offer ways to interact with (open) LLMs:

Here are some background readings on why openness matters, why closed models make bad baselines, and why some of us call for more counterfoil research in times of hype:

  • The gradient of generative AI release — FACCT '23 paper by Irene Solaiman on degrees of openness in generative AI
  • Closed AI models make bad baselines, by Anna Rogers. Proposes a simple principle: "That which is not open and reasonably reproducible cannot be considered a requisite baseline."
  • Why ChatGPT is bad for open psycholinguistics — by Cassandra Jacobs. Quote: "The downsides of ChatGPT are specific to it—not intrinsic to language modeling as a whole. Using ChatGPT [in] one’s work undermines open science, reproducibility & lacks the flexibility of previous systems that could be manipulated & changed to suit one’s scientific needs."
  • Stop feeding the hype and start resisting, by Iris van Rooij. Quote: "It’s almost as if academics are eager to do the PR work for OpenAI (the company that created ChatGPT; as well as its predecessor GPT-3 and its anticipated successor GPT-4). Why?"
  • AI is a lot of work — by Josh Dzieza for The Verge. Quote: "ChatGPT seems so human because it was trained by an AI that was mimicking humans who were rating an AI that was mimicking humans who were pretending to be a better version of an AI that was trained on human writing."

Contribute

Contributions welcome! Read the contribution guidelines first.

List of contributors:

Made with contrib.rocks.

opening-up-chatgpt.github.io's People

Contributors

liesenf avatar mdingemanse avatar timjzee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

nasa03 ranik2k23

opening-up-chatgpt.github.io's Issues

Set up github actions workflow to render and deploy to github pages

Goal is ultimately to collate all info from project-specific CSV files and use that to render github pages, to be deployed automatically when we commit to a certain branch, as outlined here.

Two types of pages we want to render:

  1. A big table comparable to Figure 1 in the paper that lists all included projects
  2. A more texty list of all projects, names, notes, and key observations per project.

change the title?

This is a great resource. I had an incredibly difficult time finding it again in my search history because of the title, "Opening up ChatGPT" because I was looking for something that alluded to comparison of LLM models on their open-source characteristics. Think you might improve discoverability by being more explicit there.

Thanks for the resource!

Add OpenChat

https://github.com/imoneoi/openchat

OpenChat is a series of open-source language models based on supervised fine-tuning (SFT). We release two versions (v1 and v2) models. Specifically, v1 uses only ~6K GPT-4 conversations directly filtered from the ~90K ShareGPT conversations, while v2 adopts cleaned ~80k ShareGPT conversations with a conditioning strategy and weighted loss. Despite our methods being simple, OpenChat has demonstrated remarkable performance. Our final vision is to develop a high-performance, open-source and commercially available large language model, and we are still moving on.

How to add a new model:

  1. Make a copy of AA_sample.yaml
  2. Give it the name of the model (e.g. gtp4all.yaml
  3. Fill out as many of the features you can find information for. Be sure to provide evidence and arguments in the form of links and notes.
  4. File a pull request. The repo maintainers will have a look and hopefully merge your contributions to main.
  5. Voilà, it will appear in the live table

Convert our initial dataset to per-project .csv files

Sth like the following:

|- projects
|- chatgpt.csv
|- falcon-40B-instruct.csv

in which csv are simple two column files, easy to edit & diff, as follows:

var_name,value
project_name,"name"
project_link,"URL"
maker, "TII"
maker_link,"URL"
opencode_classification,"partial"
opencode_link,"URL"
opencode_notes,"string"
license_classification,"string"
license_link,"URL"
license_notes,"string"
...

Where all _classification fields are "open", "partial", "not open"

With data organized this way we can move towards a system where github actions take all csvs and render them together in a github pages table, awesomelist template, etc.

Populate links and notes fields

First import of data was just to reproduce table from paper; as we move towards release it would be good to have all data points documented and linked. I've made a start with xmtf, Open Assistant, OpenChatKit and a bunch more — could you take a look at this @liesenf?

Add gpt4all / gpt4all-J

https://github.com/nomic-ai/gpt4all

GPT4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer grade CPUs. (...) The goal is simple - be the best instruction tuned assistant-style language model that any person or enterprise can freely use, distribute and build on.

Preprint offers a fair bit of detail:
https://static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf

How to add a new model:

  1. Make a copy of AA_sample.yaml
  2. Give it the name of the model (e.g. gtp4all.yaml
  3. Fill out as many of the features you can find information for. Be sure to provide evidence and arguments in the form of links and notes.
  4. File a pull request. The repo maintainers will have a look and hopefully merge your contributions to main.
  5. Voilà, it will appear in the live table

Expand descriptions of 13 x 3 dimensions

Goal: all 13 dimensions should have clearly defined "open", "partial", "closed" states to aid reproducibility in coding and to help resolve judgment calls or proposals to change codes.

There should also be some rules about inherited and derived elements. E.g., inherited elements can never be more open than their source, but they can be more closed (as when a model is described as being based "80% on dataset X" without specifying how sampling was done).

Add ChatGLM-6B and ChatGLM2-6B

  1. https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md

ChatGLM-6B uses technology similar to ChatGPT, optimized for Chinese QA and dialogue. The model is trained for about 1T tokens of Chinese and English corpus, supplemented by supervised fine-tuning, feedback bootstrap, and reinforcement learning wit human feedback. With only about 6.2 billion parameters, the model is able to generate answers that are in line with human preference.

  1. https://github.com/THUDM/ChatGLM2-6B

ChatGLM2-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model ChatGLM-6B. It retains the smooth conversation flow and low deployment threshold of the first-generation model, while introducing the following new features:

Explore ways to transclude yaml files

The permalink generation method like the below looks very nice (good find @liesenf). However, it is tied to a particular commit, so it won't update. (This is where the hash comes from.)

name: BELLE
link: https://github.com/LianjiaTech/BELLE
notes:
llmbase: LLaMA & BLOOMZ
rlbase: alpaca & shareGPT (synthetic)
license: Apache License 2.0
org:
name: KE Technologies
link: http://www.ke.com
notes:
# availability:
opencode:
class: open
link:
notes:
llmdata:
class: open
link:
notes:
llmweights:
class: partial
link:
notes: LLaMA based but copyright status unclear
rldata:
class: partial
link:
notes:
rlweights:
class: closed
link:
notes:
license:
class: closed
link:
notes: LLaMA licence agreement
# documentation
code:
class: closed
link:
notes:
architecture:
class: open
link:
notes:
preprint:
class: open
link: https://arxiv.org/abs/2303.14742
notes:
paper:
class: closed
link:
notes:
modelcard:
class: closed
link:
notes:

Any other methods available?

Add Vicuna 33B v 1.3

New and bigger Vicuna version appears sufficiently different from the one we already have (Vicuna 13B) to include.

See here for differences: basically, v 1.3 is the first to introduce 33B parameter version based on LLaMA
https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md

How to add a new model:

  1. Make a copy of AA_sample.yaml
  2. Give it the name of the model (e.g. gtp4all.yaml
  3. Fill out as many of the features you can find information for. Be sure to provide evidence and arguments in the form of links and notes.
  4. File a pull request. The repo maintainers will have a look and hopefully merge your contributions to main.
  5. Voilà, it will appear in the live table

set up repo to prepare for csv-to-interactive table workflow

Goal

Create a live updated table similar to the one in our paper:

image

How to get there

The website branch has a projects folder with 3 CSVs (and a sample_project.csv template to create more). These are minimal CSVs that represent the basic data structure we need for each project:

project_name,"Stanford Alpaca"
project_link,"https://crfm.stanford.edu/2023/03/13/alpaca.html"
project_notes,""
opencode_class,"open"
opencode_link,""
opencode_notes,""
llmdata_class,"open"
llmdata_link,"https://github.com/tatsu-lab/stanford_alpaca#data-release"
llmdata_notes,""
llmweights_class,"partial"
llmweights_link,""
llmweights_notes,"LLaMA based, copyright status unclear"
  • project_ fields contain metadata
  • a further set of data fields (opencode, llmdata, etc.) always comes in threes:
    • _class with three possible values: "open", "partial", "closed". These map onto what is currently "✓,~,✗" and green/orange/red in the Table in the paper.
    • _link with a URL that documents the source for the class judgement
    • _notes that provides explanation and reasoning for the class judgement

What we need

Roughly, the following:

  • A github action, triggered when we close a pull request, that pulls together the latest version of all csvs in the projects folder and merges them into one
  • A conversion of that master csv file into a live table, where
    • project_name is a row
    • data fields (opencode, llmdata, llmweights, etc.) are columns
    • cells get a "✓,~,✗" and a colour depending on the _class data field
    • cells contain links that correspond to the _link data field
    • the links in the cells contain a title attribute that corresponds to the _notes data field
  • A github action, triggered when the table is updated, that writes it to a github page.

nice to have: collapse model families

Many models come in variants that differ only in the number of parameters (7B/30B etc) and are otherwise the same on all criteria.

So far we simply have picked one, which is fine. But it can be useful to have them besides one another. Only then it'd be nice to also be able to collapse a family so that the table doesn't become cluttered.

Would require adding a variable family and checking for that. Would also require some html and CSS adjustments.

Add Chimera-Chat-7B

https://github.com/FreedomIntelligence/LLMZoo#chimera-llm-mainly-for-latin-and-cyrillic-languages

How to add a new model:

  1. Make a copy of AA_sample.yaml
  2. Give it the name of the model (e.g. chimera-chat-7B.yaml)
  3. Fill out as many of the features you can find information for. Be sure to provide evidence and arguments in the form of links and notes.
  4. File a pull request. The repo maintainers will have a look and hopefully merge your contributions to main.
  5. Voilà, it will appear in the live table

Table: Order projects by degree of openness

Not clear how projects are ordered right now, but I think ordering by degree of openness makes most sense.

There are many ways to sort but here's a reasonable start:

  1. Count all _class fields and assign 1 for open, 0.5 for partial, 0 for closed. That score is openness score.
  2. Order by openness score, with alphabetical ordering for ties

I foresee that we may want to tinker with the weighting here so it would be nice to have this a bit of modular code that is easy to adjust.

I think this is best done in consolidate_csv.py when the table is created.

Nice to have: direct link to metadata source file in table

We use the title attribute to show notes on hover, but this makes sense only on desktops. To make the notes more accessible, each row should probably also include a direct link to the source CSV.

A good place for this link would be the right most cell of the secondary row (in grey like the other text in that row).

Additional benefit is that it provides a quick link for when folks want to contribute edits.

Add training time & hardware reported (towards energy consumption measures)

Djoerd Hiemstra notes (on Mastodon):

als je opening up chatgpt uitbreidt met een kolom "training time reported" en "hardware reported" (en de code is openbaar) dan kunnen we een schatting doen van de hoeveelheid energy dat het (eenmaal) trainen van het model heeft gekost, plus de hoeveelheid CO2 die daarbij uitgestoten werd.

This would be an interesting experiment, perhaps best done in a fork or a separate branch to see (i) how much data there is and (ii) how much two additional columns mess up the website layout.

Would require adding two more groups of key/value pairs in the yaml template (and in all existing YAML files) and adding those same fields (and default weights) to consolidate_csv.py.

training_reported:
    class: closed
    link:
    notes:

hardware_reported:
    class: closed
    link:
    notes:

csv > md

Make as human-readable as possible, so change format of individual project files to .md

Then build ways to collate (subsets of) these fields in the master awesomelist on the main page?

Add Airoboros

https://huggingface.co/jondurbin/airoboros-l2-70b-gpt4-1.4.1

Basically Llama2 with GPT4-generated instruction tuning.

The licensing situation is particularly interesting with this one:

The ToS for OpenAI API usage has a clause preventing the output from being used to train a model that competes with OpenAI
what does compete actually mean here?
these small open source models will not produce output anywhere near the quality of gpt-4, or even gpt-3.5, so I can't imagine this could credibly be considered competing in the first place
if someone else uses the dataset to do the same, they wouldn't necessarily be violating the ToS because they didn't call the API, so I don't know how that works
the training data used in essentially all large language models includes a significant of copyrighted or otherwise unallowable licensing in the first place
other work using the self-instruct method, e.g. the original here: https://github.com/yizhongw/self-instruct released the data and model as apache-2
I am purposingly leaving this license ambiguous (other than the fact you must comply with the Meta original license) because I am not a lawyer and refuse to attempt to interpret all of the terms accordingly.

Consider moving from csv to yaml

CSV is tricky to edit and we run into issues with separators (esp. ,) in some fields (e.g. notes); yaml is more human readable.

Add ColossalChat

https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#what-is-colossalchat-and-coati-

ColossalChat is the project to implement LLM with RLHF, powered by the Colossal-AI project.

How to add a new model:

  1. Make a copy of AA_sample.yaml
  2. Give it the name of the model (e.g. gtp4all.yaml
  3. Fill out as many of the features you can find information for. Be sure to provide evidence and arguments in the form of links and notes.
  4. File a pull request. The repo maintainers will have a look and hopefully merge your contributions to main.
  5. Voilà, it will appear in the live table

restructure list to allow automatic parsing

Goal is to be able to generate a colour-coded overview much like in the paper from data structured as in an awesomelist. For this to happen we need to give every instruction-following text generator its own section and use the same 13 tags in each section, with the value of the tag being sth like Red / Yellow Green or 0/1/2 to encode degrees of openness.

Add Xwin-LM

https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1

Xwin-LM aims to develop and open-source alignment technologies for large language models, including supervised fine-tuning (SFT), reward models (RM), reject sampling, reinforcement learning from human feedback (RLHF), etc. Our first release, built-upon on the Llama2 base models, ranked TOP-1 on AlpacaEval. Notably, it's the first to surpass GPT-4 on this benchmark. The project will be continuously updated.

Add LLaMA 2 (llama-2-chat)

https://ai.meta.com/resources/models-and-libraries/llama/

Llama-2-chat uses reinforcement learning from human feedback to ensure safety and helpfulness.

Training Llama-2-chat: Llama 2 is pretrained using publicly available online data. An initial version of Llama-2-chat is then created through the use of supervised fine-tuning. Next, Llama-2-chat is iteratively refined using Reinforcement Learning from Human Feedback (RLHF), which includes rejection sampling and proximal policy optimization (PPO).

Preprint / 'paper' hosted on fbcdn (why not arxiv?): https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Make link to yaml source more prominent

On mobile devices, hover won't work so the site doesn't adequately show the sheer number of evidence-based data points we have assembled. It is also hard to hit the § link at the end of a secondary row.

Proposal: link every model name to its yaml file, use org name to link to project: link (org doesn't need its own link, they have enough SEO juice anyway). What do you say @liesenf @timjzee ?

Generate single index.html

Right now table.html is rendered and javascript is used to plug it into index.html on load. Advantage: we can edit the header and footer info separately from the generated table. Disadvantage: probably not being indexed optimally, so not great for SEO.

Goal: render a single index.html, while keeping the ability for us to edit the info above and below the table separately.

So: can we have a script combine template.html and table.html (in template.html, a <div id="included-table"></div> would determine where the table goes)?

Table: skip sample .csv

We want to have an empty sample .csv in /projects/ that people can use to add a new project, but that itself won't show up.

Two ways of going about this:

  1. Hardcode sample.csv as a csv that will never be included
  2. Check whether project_name = "" and exclude based on that

I prefer option 2 because it is cleaner / more flexible

Move repo & site over to organisation

Reorganise things to have everything under the opening-up-chatgpt org.
Two repos:

  • data: contains the readme.md for contributors and the /projects/ folder with .csvs
  • site: contains the scripts for merging csvs and writing to table.html

GitHub actions should then reorganised such that a new commit to the main branch of data triggers an update to site, causing new table to be written and the live table to update.

Does Baize-Chat qualify as a full LLM+RLHF solution to be included?

https://huggingface.co/project-baize/baize-v2-13b

Baize is an open-source chat model fine-tuned with LoRA. This model is a 13B Baize-v2, trained with supervised fine-tuning (SFT) and self-distillation with feedback (SDF). This checkpoint has been merged with LLaMA so it's ready for use.

But also:

Using Baize checkpoints directly without the following format will not work.
[|Human|] and [|AI|] are required to mark the messages from the user and Baize. We recommend checking out our GitHub to find the best way to use Baize with our demo or Fastchat.

Add Humpback?

https://aibusiness.com/ml/meet-humpback-meta-s-new-ai-model-that-s-a-whale-of-an-upgrade-of-llama

https://doi.org/10.48550/arXiv.2308.06259

However, no claim in the paper that this is "open" or "open source" and indeed the author indicates legal trouble with making this open:

While I/we are proud of FAIR's open values/history, on this particular one we are still trying to navigate. Using Clueweb may have been a mistake in this regard as that is not free.. in general the product+legal landscape is very difficult these days :( .. just getting the paper release itself approved took many weeks...
https://twitter.com/jaseweston/status/1691428194170122240

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.