Code Monkey home page Code Monkey logo

fm-cheatsheet's Introduction

The Foundation Model Development Cheatsheet

Resources and recommendations for best practices in developing and releasing models.

Cheatsheet | Contribute Resources | Paper | Contact and Citation

The Foundation Model Development Cheatsheet

Add to Cheatsheet

To contribute resources to the cheatsheet, please review the Criteria for Inclusion below, and the Add Resource Instructions.

Criteria for Inclusion:

The resources are selected based on a literature review for each phase of foundation model development. Inclusion is predicated on: the perceived helpfulness as a development tool, the extent and quality of the documentation, and the insights brought to the development process. Please ensure your candidate resource will meaningfully aid responsible development practices. While we do accept academic literature as a resource, this cheatsheet focuses on tools, such as data catalogs, search/analysis tools, evaluation repositories, and, selectively, literature that summarizes, surveys, or guides important development decisions.

We will review suggested contributions and (optionally) acknowledge contributors to this cheatsheet on the website and in future work.

Add Resource Instructions:

  • Option 1: Use this upload form to contribute a resource.

  • Option 2: Bulk upload resources by creating a pull request in this repository, extending app/resources/resources.jsonl.

In both cases, it is essential that the requested documentation on each resource is accurate and complete.

Contact and Citation

Contact [email protected] for questions about this resource.

Citation coming soon.

fm-cheatsheet's People

Contributors

clefourrier avatar hiyouga avatar mewil avatar neural-loop avatar shayne-longpre avatar soldni avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fm-cheatsheet's Issues

Intro Text for Risks and Harms Taxonomy Page

Replace

Taxonomies provide a way of categorising, defining and understanding risks and hazards created through the use and deployment of AI systems. Some taxonomies focus primarily on the types of interactions and uses that create a risk of harm whereas others focus on the negative effects that they lead to.

With

Taxonomies provide a way of categorising, defining and understanding risks and hazards created through the use and deployment of AI systems. The following taxonomies focus on the types of interactions and uses that create a risk of harm as well as the negative effects that they lead to.

Intro Text for Data Documentation Page

Replace

When releasing new data resources with a model, it is important to thoroughly document the data. Documentation allows users to understand its intended uses, legal restrictions, attribution, relevant contents, privacy concerns, and other limitations. Many data documentation standards have been proposed, but their adoption has been uneven. Crowdsourced documentation may contain errors and omissions

With

Data documentation allows users to understand their intended uses, legal restrictions, attribution, relevant contents, privacy concerns, and other limitations. Many data documentation standards have been proposed, but their adoption has been uneven. It is important to recognize that crowdsourced documentation may contain errors and omissions.

Intro Text for Pretraining Page

Replace:

Pretraining data consists of thousands, or even millions, of individual documents, often web scraped. Model knowledge and behavior will likely reflect a compression of this information and its communication qualities. It's important to carefully select the data composition. This decision should reflect choices in language coverage, mix of sources, and preprocessing decisions.

with:

Pretraining data provides the fundamental ingredient to foundation models—including their capabilities and flaws. Corpora consist of millions of pieces of content, from documents, images, videos, or speech recordings, often scraped from the web. It is important to carefully select the data composition and it should reflect choices in language coverage, a mixture of sources, and preprocessing decisions.

Intro Text for Education Resources Page

  1. Change subtitle to: "Additional Educational Resources"

Replace

Training models at any scale can be quite daunting to newer practitioners. Here, we include several educational resources that may be useful in learning about the considerations required for successfully and effectively training or fine-tuning foundation models.

With

Training models at any scale can be quite daunting to newer practitioners. The following educational resources may be useful in learning about the considerations required for successfully and effectively training or fine-tuning foundation models.

Intro Text for Finetuning Page

Replace:

Finetuning data is used to hone specific capabilities, orient the model to a certain task format, improve its responses to general instructions, mitigate harmful or unhelpful response patterns, or generally align its responses to human preferences. Developers use a variety of data annotations and loss objectives for finetuning, including traditional supervised finetuning, DPO, or reinforcement learning with human feedback. Explore various data catalogs, their attached documentation, and specialized finetuning data sources.

with:

Finetuning data is used to hone a model's specific capabilities, orient it to a certain task, improve its responses to instructions, mitigate harmful or unhelpful behaviors, and/or align it to human preferences. Given the thousands of specialized data sources for finetuning, we recommend using data catalogs that provide well documented datasets.

Intro Text for Pretraining Repositories Page

Replace

Practitioners should consider using already-optimized codebases, especially in the pre-training phase, to ensure effective use of computational resources, capital, power, and effort. Existing open-source codebases targeted at foundation model pretraining can make pretraining significantly more accessible to new practitioners and help accumulate techniques for efficiency in model training.

With

Practitioners should consider using already-optimized codebases, especially in the pre-training phase, to ensure effective use of computational resources, capital, power, and effort. Existing open-source codebases targeted at foundation model pretraining can be significantly more accessible to new practitioners and help contribute to efficient training strategies.

Intro Text for Model Documentation Page

Replace

It is important to document models that are used and released. Even models and code released openly are important to document thoroughly, in order to specify how to use the model, recommended and non-recommended use cases, potential harms, state or justify decisions made during training, and more.

With

It is important to document models that are used and released. Even models and code released openly are important to document thoroughly, in order to specify how to use the model, recommended and non-recommended use cases, potential harms, state or justify decisions made during training, and more. The following tools can help with documentation.

Intro Text for Risk and Harm Evaluation Page

  1. Change the link text on the homepage from "Risks and Harms" to "Risks and Harms Evaluation"

Replace

Evaluations of risk serve multiple purposes: to identify if there are issues which need mitigation, to track the success of any such mitigations, to document for other users of the model what risks are still present, and to help make decisions related to model access and release.

With

The following tools for evaluating risk serve multiple purposes: to identify if there are issues which need mitigation, to track the success of any such mitigations, to document for other users of the model what risks are still present, and to help make decisions related to model access and release.

Intro Text for Data Decontamination Page

Replace

Data decontamination is the process of removing evaluation data from the training dataset. This important step in data preprocessing ensures the integrity of model evaluation, ensuring that metrics are reliable and not misleading. The following resources aid in proactively protecting test data with canaries, decontaminating data before training, and identifying or proving what data a model was trained on.

With

Data decontamination is the process of removing evaluation data from the training set. This step ensures the integrity of model evaluation. The following resources aid in proactively protecting test data with canaries, decontaminating data before training, and identifying or proving what data a model was trained on.

Intro Text for Data Exploration Page

Replace subheading

Data Search, Analysis, & Exploration

With

Data Exploration

Replace text

Exploring training datasets with search and analysis tools helps practitioners develop a nuanced intuition for what's in the data, and therefore their model. Many aspects of data are difficult to summarize or document without hands-on exploration. Text data, for example, can have a distribution of lengths, topics, tones, formats, licenses, and even diction.

With

Exploring training datasets with search and analysis tools helps practitioners develop a nuanced intuition for what is in the data, and therefore their model. Data can be difficult to understand, summarize or document without hands-on exploration.

Intro Text for Usage Monitoring Page

Replace

Some open foundation model developers attempt to monitor the usage of their models, whether by watermarking model outputs or gating access to the model.

With

Monitoring foundation model usage is an evolving area of research. The following techniques, such as watermarking model outputs or gating access to the model, are some of the ways to do so.

Intro Text for Finetuning Repositories

Replace

Fine-tuning, or other types of adaptation performed on foundation models after pretraining, are an equally important and complex step in model development. Fine-tuned models are more frequently deployed than base models. Here, we also link to some useful and widely-used resources for adapting foundation models or otherwise fine-tuning them.

With

Finetuning or adaptation of foundation models is a complex step in model development. These models are more frequently deployed than base models. Here, we link to some useful and widely-used resources for finetuning.

Intro Text for Eval Capabilities Page

Replace

Many modern foundation models are released with general conversational abilities, such that their use cases are poorly specified and open-ended. This poses significant challenges to evaluation benchmarks which are unable to critically evaluate so many tasks, applications, and risks systematically or fairly. As a result, it is important to carefully scope the original intentions for the model, and the evaluations to those intentions.

With

Many modern foundation models are released with general abilities, such that their use cases are poorly specified and open-ended, posing significant challenges to evaluation benchmarks which are unable to critically evaluate so many tasks, applications, and risks systematically or fairly. It is important to carefully scope the original intentions for the model, and the evaluations to those intentions.

Intro Text for License Selection

Replace

Foundation models, like software, are accompanied by licenses that determine how they may be distributed, used, and repurposed. There are a variety of licenses to choose between for open foundation model developers, presenting potential challenges for new developers.

With

Foundation models, like software, are accompanied by licenses that determine how they may be distributed, used, and repurposed. The following resources can help one determine which type of license to use.

Intro Text for Data Auditing Page

Replace

Auditing datasets is an essential component of dataset design. Spend a substantial amount of time reading through your dataset, ideally at many stages of the dataset design process. Many datasets have problems specifically because the authors did not do sufficient auditing before releasing them. Use systematic studies of the process in addition to data search, analysis, & exploration tools to track the dataset's evolution.

With

Auditing datasets is essential, spend a substantial amount of time inspecting your dataset at multiple stages of the dataset design process. Many datasets have problems specifically because the authors did not do sufficient auditing before releasing them. Use systematic studies of the process in addition to data search, analysis, & exploration tools to track the dataset's evolution.

Intro Text for Data Duplication Page

Replace

Data deduplication is an important preprocessing step where duplicated documents, or chunks within a document, are removed from the dataset. Removing duplicates can reduce the likelihood of memorizing undesirable pieces of information such as boilerplate text, copyrighted data, and personally identifiable information. Additionally, removing duplicated data improves training efficiency by reducing the total dataset size. Practitioners should always determine whether duplicated data will harm or help the model for their use case.

With

Removing data duplicates can 1) reduce the likelihood of memorizing undesirable pieces of information such as boilerplate text, copyrighted data, and personally identifiable information, 2) improves training efficiency by reducing the total dataset size. Practitioners should always determine whether duplicated data will harm or help the model for their use case.

Intro Text for Reproducibility Resources Page

Replace

Model releases often go accompanied with claims on evaluation performance, but those results are not always reproducible, or can be misleading. If code is not released, is not comprehensive, is difficult to run, or misses key details, this will cost the scientific community time and effort to replicate and verify the claims. Replication time will also slow progress, and discourage developers from adopting that resource over others.

With

Model releases accompanied with claims on performance that are not reproducible, code that is unavailable, incomplete, or difficult to run costs the scientific community time and effort. The following resources are valuable to help others replicate and verify the claims.

Intro Text for Efficiency and Resource Allocation Page

Replace

Knowledge of training best practices and efficiency techniques can reduce costs to train a desired model significantly. Here, we include a select few readings and resources on effectively using a given resource budget for model training, such as several canonical papers on fitting scaling laws.

With

Knowledge of training best practices can reduce the cost of training a desired model significantly. Here, we link to readings and resources on effectively using a given resource budget for model training, including canonical papers on fitting scaling laws.

How we can cite your work?

Hello,

Many thanks for releasing this, really cool work 😃 I am working on a paper and I was wondering how can we cite your work.

Thanks,
George

Intro Text for Environmental Impact Page

Replace

Current tools focus on measuring the energy consumed during training or inference and multiplying it by the carbon intensity of the energy source used. For efficient use of resources, several decisions made during or prior to model training can have significant impacts on the upstream and downstream environmental impact of a given model.

With

Foundation model development is often resource intensive. The following tools help one to measure energy consumption and estimate the carbon intensity of the energy source used. Decisions made during or prior to model training can have a significant effect on the upstream and downstream environmental impact of a given model.

Intro Text for Data Cleaning Page

Replace

Data cleaning and filtering are crucial steps in curating a dataset. They remove unwanted data, improving training efficiency and ensuring desirable properties like high information content, desired languages, low toxicity, and minimal personally identifiable information. Consider trade-offs when using filters and understand the importance of data mixing in preparation.

With

Data quality is crucial. Filtering can remove unwanted data, improving training efficiency and ensuring desirable properties like high information content, desired languages, low toxicity, and minimal personally identifiable information. Consider trade-offs when using filters and understand the importance of data mixtures.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.