Comments (17)
It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this
{"prompt": "Russia Finishes Building Iran Nuclear Plant MOSCOW (Reuters) - Russia and Iran said Thursday they had finished construction of an atomic power plant in the Islamic Republic -- a project the United States fears Tehran could use to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"}
{"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
{"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
from gpt4all.
WHat if i do not have formatted data, just lots of pages of knowledge. Do I have to convert it to above mentioned method? Ig yes how
from gpt4all.
I'm also interested
from gpt4all.
Same, would be awesome to work on private knowledge base
from gpt4all.
updated title here to make it more like a feature request. we plan on offering this soon
from gpt4all.
It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this
{"prompt": "Russia Finishes Building Iran Nuclear Plant MOSCOW (Reuters) - Russia and Iran said Thursday they had finished construction of an atomic power plant in the Islamic Republic -- a project the United States fears Tehran could use to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"} {"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"} {"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
Thanks buddy. Unfortunately
My data set format base on prompt and completion.
I'm doing this for pashto poetry generation.
But I want to run the model on single GPU. Is something like this possible?
from gpt4all.
But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR eyes
There seem to be another questions to answer to be able to train custom data. Could we reopen this issue? It would be helpful if these terms be in the documentation to other be able to train their own chat with their own data.
Put the filesystem path to the directory containing your hf formatted model and tokenizer files in those fields. If you don’t have a wandb account, which I assume is the case since otherwise it would be obvious, disable wandb. You may have to comment out a line or two if you disable wandb tracking. You may find it easier to just get a wandb account.
from gpt4all.
WHat if i do not have formatted data, just lots of pages of knowledge. Do I have to convert it to above mentioned method? Ig yes how
There is this blog regarding your question
https://levelup.gitconnected.com/training-your-own-llm-using-privategpt-f36f0c4f01ec
from gpt4all.
Also interested. Could be an additional DOC file to do this.
from gpt4all.
But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR 👀
from gpt4all.
updated title here to make it more like a feature request. we plan on offering this soon
Thanks, I will also contribute...
from gpt4all.
I have tons of documentation in the form of Gitlab pages, such as https://batchdocs.web.cern.ch/ . Has anyone worked on the automatic preprocessing of Markup language pages to be feed as train dataset to gpt4all?
from gpt4all.
But what do we set as model name in the file, and wandb entities? These are also requested, but not explained, and if we understood those, I'd be up for a documentation PR eyes
There seem to be another questions to answer to be able to train custom data. Could we reopen this issue? It would be helpful if these terms be in the documentation to other be able to train their own chat with their own data.
from gpt4all.
@daleevans Thank you four your feedback! I will try your instructions soon! 🤗
from gpt4all.
I cannot find any 'yaml' or 'finetune...' or 'config' folder or files. I'm using Windows version. Is it possible to train with custom data in Windows?
from gpt4all.
under configs folder...
from gpt4all.
It's very easy, just change the dataset_path in the config (configs/train/finetune_lora.yaml or configs/train/finetune.yaml). The format of the data is trivial, just jsonl like this
{"prompt": "Russia Finishes Building Iran Nuclear Plant MOSCOW (Reuters) - Russia and Iran said Thursday they had finished construction of an atomic power plant in the Islamic Republic -- a project the United States fears Tehran could use to make nuclear arms. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "This is a piece of news regarding world politics and science and technology.", "source": "bigscience/p3"} {"prompt": "Goosen brings a sparkle to the gloom With overnight rain causing a 2-hour delay to the start of play in the HSBC World Matchplay Championship at Wentworth, Retief Goosen did his bit to help to make up ground. \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"} {"prompt": "Geiberger has share of clubhouse lead at Greensboro Brent Geiberger took advantage of ideal morning conditions to earn a share of the clubhouse lead with a 6-under-par 66 during the first \nIs this a piece of news regarding world politics, sports, business, or science and technology? ", "response": "sports", "source": "bigscience/p3"}
Is there any step by step documentation?
It looks easy, I just wondering how to use the downloaded model + my additional set.
from gpt4all.
Related Issues (20)
- [Feature] Ctrl+F to search text inside a discussion
- bug
- v2.7.5 Windows Local and Server Model both use Llama 3 Instruct, program crash HOT 1
- [Feature] indicate the max context size of each model in the download list ?
- [Feature] check the compatibility of a hugging face model before fully downloading it ?
- Idk what this is honestly HOT 1
- Python Bindings: Model no longer kept in cache HOT 2
- Reliable crash test in 2.7.5 and 2.8.0pre1 HOT 3
- Python bindings: add possibility to clear history of a chat_session HOT 4
- "availableGPUDevices: built without Kompute" error when installed via pip on macOS M2 HOT 2
- [Feature] Ability to populate previous chat history when using chat_session() HOT 7
- 增加对Intel ARC A770显卡推理支持 HOT 2
- Ver. 2.7.4 nad Ver. 2.8.0 pre not starting gui on Windows HOT 2
- API service response data missing
- Building GPT4all from source - Windows - Qt.dll errors HOT 11
- Is there a WebUI available? HOT 1
- Need `#include <algorithm>` to build `gpt4all-backend/llamamodel.cpp`
- Windows 11. Nothing happens HOT 7
- GPT4All 2.8.0 client crashes instantly when adding a populated "Local docs" folder HOT 3
- Is it possible to make the "Stop Generating" button stop everything?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt4all.