Comments (4)
@kolinfluence OK. We also can inference offline.
Make sure you have the local file llama-2-7b-chat.Q4_0.gguf and model meta-llama/Llama-2-7b-chat-hf.
Please try this script. https://github.com/intel/neural-speed/blob/main/scripts/python_api_example_for_gguf.py
For example:
python scripts/python_api_example_for_gguf.py --model_name llama --model_path /your_model_path/meta-llama/Llama-2-7b-chat-hf -m /your_gguf_file_path/llama-2-7b-chat.Q4_0.gguf
OSError: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in withhuggingface-cli login
or by passingtoken=<your_token>
.
This means you don't have the right to access the llama-2-7b-chat model on the HF. You have to apply for the access token first on the HF.
may i know what's the direction to take for this neural speed thing? are you guys going to improve or seeking to merge into llama.cpp or something?
The neural speed will not be merged into llama.cpp currently. Neural Speed aims to provide the efficient LLMs inference on Intel platforms. For example, Neural Speed provides highly optimized low-precision kernels on CPUs, which means it can get better perfommance vs llama.cpp. Please check this https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176.
from neural-speed.
@kolinfluence sry about this.
I have checked your script. It's correct.
The reason may be an too old ITREX version.
I can get the correct result using your script.
As you can see the ITREX version is 1.4.1.
Please reinstall ITREX, Neural Speed and re-run the script.
pip install intel-extension-for-transformers; pip install neural_speed
from neural-speed.
where do i put the downloaded llama-2-7b-chat.Q4_0.gguf file?
The script will download the file directly from HF and automatically place it into the local HF cache. The path is like
from neural-speed.
@Zhenzhong1 i used the same script but i get this.
possible for me to manually download it as i actually have too many things on my laptop and wish not to use hugging face access etc.
so how do i manually download it and try?
p.s. : may i know what's the direction to take for this neural speed thing? are you guys going to improve or seeking to merge into llama.cpp or something?
python run_model.py
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
response.raise_for_status()
File "/root/miniconda3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/transformers/utils/hub.py", line 385, in cached_file
resolved_file = hf_hub_download(
^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1374, in hf_hub_download
raise head_call_error
File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1247, in hf_hub_download
metadata = get_hf_file_metadata(
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1624, in get_hf_file_metadata
r = _request_wrapper(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 402, in _request_wrapper
response = _request_wrapper(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 426, in _request_wrapper
hf_raise_for_status(response)
File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status
raise GatedRepoError(message, response) from e
huggingface_hub.utils._errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-6628b4c7-577559ad44f1431409bac9bc;f4fc8e13-0f16-4ad1-bb6c-e3af7b61a3a1)
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json.
Access to model meta-llama/Llama-2-7b-chat-hf is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Llama-2-7b-chat-hf to ask for access.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/src/neural-speed/run_model.py", line 12, in <module>
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 758, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 590, in get_tokenizer_config
resolved_config_file = cached_file(
^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/utils/hub.py", line 400, in cached_file
raise EnvironmentError(
OSError: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.
from neural-speed.
Related Issues (20)
- Feature request: JSON mode output HOT 1
- Huge performance difference in "Transformer-like" usage and "llama.cpp-like" usage HOT 2
- Running Q4_K_M gguf models: unrecognized tensor type 12 HOT 1
- Distributing tensors across NUMA nodes HOT 3
- Garbled characters with beam search HOT 16
- Is tensor parallelism supported by neural speed? HOT 2
- Question about Thread pool and GEMV HOT 4
- i saw how beautiful this repo is, in terms of parallelism / numa stuff etc. HOT 1
- Linking back to Neural Chat / intel-extension-for-transformers HOT 2
- Add support for phi-3-mini-128k model HOT 4
- Loading checkpoint shards takes too long HOT 2
- Error: Unable to install. HOT 5
- source build from release tar file? HOT 1
- Add support for phi3-vision HOT 1
- is it supported with Batch size >1 ? HOT 6
- Performance on Xeon Scalable HOT 1
- developer_document.md need elaboration on determining buffer sizes? HOT 1
- Bestla Kernels understanding and benchmarking HOT 8
- Whats the different with IPEX-LLM?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neural-speed.