Code Monkey home page Code Monkey logo

neural-speed's Introduction

Neural Speed

Neural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by Intel Neural Compressor. The work is inspired by llama.cpp and further optimized for Intel platforms with our innovations in NeurIPS' 2023

Key Features

  • Highly optimized kernels on CPUs with ISAs (AMX, VNNI, AVX512F, AVX_VNNI and AVX2) for N-bit weight (int1, int2, int3, int4, int5, int6, int7 and int8). See details
  • Up to 40x performance speedup on popular LLMs compared with llama.cpp. See details
  • Tensor parallelism across sockets/nodes on CPUs. See details

Neural Speed is under active development so APIs are subject to change.

Supported Hardware

Hardware Supported
Intel Xeon Scalable Processors
Intel Xeon CPU Max Series
Intel Core Processors

Supported Models

Support almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an issue if your favorite LLM does not work.

Support typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the details.

Installation

Install from binary

pip install -r requirements.txt
pip install neural-speed

Build from Source

pip install .

Note: GCC requires version 10+

Quick Start (Transformer-like usage)

Install Intel Extension for Transformers to use Transformer-like APIs.

PyTorch Model from Hugging Face

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

GGUF Model from Hugging Face

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
gguf_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, gguf_file = gguf_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

PyTorch Model from Modelscope

from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

As an Inference Backend in Neural Chat Server

Neural Speed can be used in Neural Chat Server of Intel Extension for Transformers. You can choose to enable it by adding use_neural_speed: true in config.yaml.

  • add optimization key section to use Neural Speed and its RTN quantization (example).
device: "cpu"

# itrex int4 llm runtime optimization
optimization:
    use_neural_speed: true
    optimization_type: "weight_only"
    compute_dtype: "fp32"
    weight_dtype: "int4"
  • add key use_neural_speed and key use_gptq to use Neural Speed and load GPT-Q model (example).
device: "cpu"
use_neural_speed: true
use_gptq: true

More details please refer to Neural Chat.

Quick Start (llama.cpp-like usage)

Single (One-click) Step

python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"

Multiple Steps

Convert and Quantize

# skip the step if GGUF model is from Hugging Face or generated by llama.cpp
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b
# Using the quantize script requires a binary installation of Neural Speed
mkdir build&&cd build
cmake ..&&make -j
cd ..
python scripts/quantize.py  --model_name gptj --model_file ne-f32.bin  --out_file ne-q4_j.bin  --build_dir ./build --weight_dtype int4 --alg sym

Inference

# Linux and WSL
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"
# Windows
python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"

Please refer to Advanced Usage for more details.

Advanced Topics

New model enabling

You can consider adding your own models, please follow the document: graph developer document.

Performance profiling

Enable NEURAL_SPEED_VERBOSE environment variable for performance profiling.

Available modes:

  • 0: Print full information: evaluation time and operator profiling. Need to set NS_PROFILING to ON and recompile.
  • 1: Print evaluation time. Time taken for each evaluation.
  • 2: Profile individual operator. Identify performance bottleneck within the model. Need to set NS_PROFILING to ON and recompile.

neural-speed's People

Contributors

a32543254 avatar aahouzi avatar airmeng avatar akarx23 avatar ceciliawwq avatar changwangss avatar clarkchin08 avatar ddele avatar eltociear avatar hshen14 avatar intellinjun avatar kevinintel avatar liangyx2 avatar luoyu-intel avatar lvliang-intel avatar park12sj avatar parvizmp avatar penghuicheng avatar pre-commit-ci[bot] avatar rdower avatar thanatosshinji avatar ujjawal-k-panchal avatar vincyzhang avatar xiguiw avatar xinyuye-intel avatar yuchengliu1 avatar zhentaoyu avatar zhenwei-intel avatar zhenzhong1 avatar zhewang1-intc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neural-speed's Issues

Neural Speed compilation failing in ORT

OS:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 23.10
Release:        23.10
Codename:       mantic

GCC Version

$ gcc --version
gcc (Ubuntu 13.2.0-4ubuntu3) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Cmake Version

$ cmake --version
cmake version 3.27.4

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Onnxruntime Tag

commit 8f5c79cb63f09ef1302e85081093a3fe4da1bc7d (HEAD -> v1p17p1, tag: v1.17.1, origin/rel-1.17.1)
Author: Rachel Guo <[email protected]>
Date:   Fri Feb 23 16:10:36 2024 -0800

    Update 1.17.1 patch release version (#19622)

    ### Description
    <!-- Describe your changes. -->

    Need to update patch release version.


    ### Motivation and Context
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->

    ---------

    Co-authored-by: rachguo <[email protected]>
    Co-authored-by: rachguo <[email protected]>

Command:

./build.sh --config RelWithDebInfo --parallel  --build_shared_lib --skip_tests

Error:

/onnxruntime/build/Linux/RelWithDebInfo/_deps/neural_speed-src/bestla/bestla_parallel.h: In instantiation of ‘class bestla::parallel::gemm::SchedulerBase<bestla::gemm::ICoreRowNAvxvnniKBlock<24, 2> >’:
/onnxruntime/build/Linux/RelWithDebInfo/_deps/neural_speed-src/bestla/bestla_parallel.h:476:7:   required from ‘class bestla::parallel::gemm::SchedulerKBlockS<bestla::gemm::ICoreRowNAvxvnniKBlock<24, 2> >’
/onnxruntime/build/Linux/RelWithDebInfo/_deps/neural_speed-src/bestla/bestla_parallel.h:657:14:   required from ‘void bestla::parallel::GemmRun(Launch_T&, const typename Launch_T::Param&, IThreading*) [with Parallel_T = gemm::SchedulerKBlockS<bestla::gemm::ICoreRowNAvxvnniKBlock<24, 2> >; Launch_T = bestla::wrapper::gemm::LauncherIntKBlock<BTLA_ISA::AVX_VNNI, bestla::gemm::ICoreRowNAvxvnniKBlock<24, 2>, bestla::prologue_a::gemm::ActivationF32KBlockQuantize, bestla::prologue_b::gemm::WeightKBlockNInteger, bestla::epilogue::gemm::AccumulatorWriteBackFp32>; typename Launch_T::Param = bestla::wrapper::gemm::LauncherIntKBlock<BTLA_ISA::AVX_VNNI, bestla::gemm::ICoreRowNAvxvnniKBlock<24, 2>, bestla::prologue_a::gemm::ActivationF32KBlockQuantize, bestla::prologue_b::gemm::WeightKBlockNInteger, bestla::epilogue::gemm::AccumulatorWriteBackFp32>::Param]’
/onnxruntime/onnxruntime/contrib_ops/cpu/quantization/neural_speed_gemm.cc:98:30:   required from ‘void bestla::NSSQ4GemmCompInt8(size_t, size_t, size_t, const float*, size_t, storage::gemm::StorageWeightKBlockNInteger*, float*, size_t, int8_t*, parallel::IThreading*) [with GemmCore_T = gemm::ICoreRowNAvxvnniKBlock<24, 2>; size_t = long unsigned int; int8_t = signed char]’
/onnxruntime/onnxruntime/contrib_ops/cpu/quantization/neural_speed_gemm.cc:183:64:   required from here
/onnxruntime/build/Linux/RelWithDebInfo/_deps/neural_speed-src/bestla/bestla_parallel.h:49:16: error: ‘virtual void bestla::parallel::Scheduler2D::getIndex(ThreadProblem&) const’ was hidden [-Werror=overloaded-virtual=]
   49 |   virtual void getIndex(ThreadProblem& problem) const {
      |                ^~~~~~~~
/onnxruntime/build/Linux/RelWithDebInfo/_deps/neural_speed-src/bestla/bestla_parallel.h:142:16: note:   by ‘void bestla::parallel::gemm::SchedulerBase<_GemmCore_T>::getIndex(ThreadProblem&) [with _GemmCore_T = bestla::gemm::ICoreRowNAvxvnniKBlock<24, 2>; ThreadProblem = bestla::parallel::gemm::ThreadProblemBase]’
  142 |   virtual void getIndex(ThreadProblem& problem) {
      |                ^~~~~~~~
/onnxruntime/build/Linux/RelWithDebInfo/_deps/neural_speed-src/bestla/bestla_parallel.h:66:16: error: ‘virtual void bestla::parallel::Scheduler2D::update(const bestla::parallel::Config2D&)’ was hidden [-Werror=overloaded-virtual=]
   66 |   virtual void update(const Config2D& config) {
      |                ^~~~~~
/onnxruntime/build/Linux/RelWithDebInfo/_deps/neural_speed-src/bestla/bestla_parallel.h:151:16: note:   by ‘void bestla::parallel::gemm::SchedulerBase<_GemmCore_T>::update(const bestla::parallel::gemm::Config&) [with _GemmCore_T = bestla::gemm::ICoreRowNAvxvnniKBlock<24, 2>]’
  151 |   virtual void update(const Config& config) {
      |                ^~~~~~
[ 78%] Built target onnx_test_runner_common
[ 78%] Built target onnxruntime_session
[ 78%] Linking CXX static library libonnxruntime_framework.a
[ 78%] Linking CXX shared module libtest_execution_provider.so
[ 78%] Built target test_execution_provider
[ 78%] Built target onnxruntime_framework
[ 78%] Linking CXX static library libonnxruntime_graph.a
[ 78%] Linking CXX static library libonnxruntime_util.a
[ 78%] Built target onnxruntime_graph
[ 78%] Built target onnxruntime_util
cc1plus: all warnings being treated as errors
gmake[2]: *** [CMakeFiles/onnxruntime_providers.dir/build.make:2498: CMakeFiles/onnxruntime_providers.dir/onnxruntime/onnxruntime/contrib_ops/cpu/quantization/neural_speed_gemm.cc.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:2010: CMakeFiles/onnxruntime_providers.dir/all] Error 2
gmake: *** [Makefile:166: all] Error 2
Traceback (most recent call last):
  File "/onnxruntime/tools/ci_build/build.py", line 2887, in <module>
    sys.exit(main())
             ^^^^^^
  File "/onnxruntime/tools/ci_build/build.py", line 2779, in main
    build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
  File "/onnxruntime/tools/ci_build/build.py", line 1659, in build_targets
    run_subprocess(cmd_args, env=env)
  File "/onnxruntime/tools/ci_build/build.py", line 839, in run_subprocess
    return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/onnxruntime/tools/python/util/run.py", line 49, in run
    completed_process = subprocess.run(
                        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/cmake', '--build', '/onnxruntime/build/Linux/RelWithDebInfo', '--config', 'RelWithDebInfo', '--', '-j344']' returned non-zero exit status 2.

AssertionError: Fail to convert pytorch model

this is using the example code only

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
print(outputs)

yields

2024-03-27 02:12:43 [INFO] Using Neural Speed...
2024-03-27 02:12:43 [INFO] cpu device is used.
2024-03-27 02:12:43 [INFO] Applying Weight Only Quantization.
2024-03-27 02:12:43 [INFO] Using LLM runtime.
cmd: ['python', PosixPath('/usr/local/lib/python3.10/dist-packages/neural_speed/convert/convert_mistral.py'), '--outfile', 'runtime_outs/ne_mistral_f32.bin', '--outtype', 'f32', 'Intel/neural-chat-7b-v3-1']
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-17-40dcb74a8701>](https://localhost:8080/#) in <cell line: 10>()
      8 streamer = TextStreamer(tokenizer)
      9 
---> 10 model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
     11 outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
     12 print(outputs)

1 frames
[/usr/local/lib/python3.10/dist-packages/neural_speed/__init__.py](https://localhost:8080/#) in init(self, model_name, use_quant, use_gptq, use_awq, use_autoround, weight_dtype, alg, group_size, scale_dtype, compute_dtype, use_ggml)
    129         if not os.path.exists(fp32_bin):
    130             convert_model(model_name, fp32_bin, "f32")
--> 131             assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
    132 
    133         if not use_quant:

AssertionError: Fail to convert pytorch model

image

source build from release tar file?

I'd really like to build from source using one of the release tar files, but I'm having issues. Anybody know of any magic to make it possible? If you can't build from the release files, what purpose do the serve?

(nspeed) [host1:neural-speed-1.0] pip install -r requirements.txt
  .
  .
  .
(nspeed) [host1:neural-speed-1.0] pip install .
Processing /scr/nspeed/neural-speed-1.0
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [46 lines of output]
      /opt/pytorch-2.3.0/envs/nspeed/lib/python3.11/site-packages/setuptools/__init__.py:81: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
      !!
      
              ********************************************************************************
              Requirements should be satisfied by a PEP 517 installer.
              If you are using pip, you can try `pip install --use-pep517`.
              ********************************************************************************
      
      !!
        dist.fetch_build_eggs(dist.setup_requires)
      WARNING setuptools_scm.pyproject_reading toml section missing 'pyproject.toml does not contain a tool.setuptools_scm section'
      Traceback (most recent call last):
        File "/scr/nspeed/neural-speed-1.0/.eggs/setuptools_scm-8.1.0-py3.11.egg/setuptools_scm/_integration/pyproject_reading.py", line 36, in read_pyproject
          section = defn.get("tool", {})[tool_name]
                    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
      KeyError: 'setuptools_scm'
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/scr/nspeed/neural-speed-1.0/setup.py", line 232, in <module>
          setup(
        File "/opt/pytorch-2.3.0/envs/nspeed/lib/python3.11/site-packages/setuptools/__init__.py", line 104, in setup
          return distutils.core.setup(**attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/pytorch-2.3.0/envs/nspeed/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 146, in setup
          _setup_distribution = dist = klass(attrs)
                                       ^^^^^^^^^^^^
        File "/opt/pytorch-2.3.0/envs/nspeed/lib/python3.11/site-packages/setuptools/dist.py", line 307, in __init__
          _Distribution.__init__(self, dist_attrs)
        File "/opt/pytorch-2.3.0/envs/nspeed/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 284, in __init__
          self.finalize_options()
        File "/opt/pytorch-2.3.0/envs/nspeed/lib/python3.11/site-packages/setuptools/dist.py", line 658, in finalize_options
          ep(self)
        File "/opt/pytorch-2.3.0/envs/nspeed/lib/python3.11/site-packages/setuptools/dist.py", line 678, in _finalize_setup_keywords
          ep.load()(self, ep.name, value)
        File "/scr/nspeed/neural-speed-1.0/.eggs/setuptools_scm-8.1.0-py3.11.egg/setuptools_scm/_integration/setuptools.py", line 103, in version_keyword
          _assign_version(dist, config)
        File "/scr/nspeed/neural-speed-1.0/.eggs/setuptools_scm-8.1.0-py3.11.egg/setuptools_scm/_integration/setuptools.py", line 58, in _assign_version
          _version_missing(config)
        File "/scr/nspeed/neural-speed-1.0/.eggs/setuptools_scm-8.1.0-py3.11.egg/setuptools_scm/_get_version_impl.py", line 117, in _version_missing
          raise LookupError(
      LookupError: setuptools-scm was unable to detect version for /scr/nspeed/neural-speed-1.0.
      
      Make sure you're either building from a fully intact git repository or PyPI tarballs. Most other sources (such as GitHub's tarballs, a git checkout without the .git folder) don't contain the necessary metadata and will not work.
      
      For example, if you're using pip, instead of https://github.com/user/proj/archive/master.zip use git+https://github.com/user/proj.git#egg=proj
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

I did try the PEP 517 suggestion but then it started complaining about "ModuleNotFoundError: No module named 'cmake'" which I'm sure it there. I think the stuff at the end about git repositories is probably more relevant. I could not figure out how to work the "git+https://github.com/user/proj.git#egg=proj" suggestion.

error running inference

HI
I was ruuning
NEURAL_SPEED_VERBOSE=1 python scripts/run.py /home/rachels.dell/confluence-search/intel_neural_chat --weight_dtype int4 -p "She opened the door and see"
this ran as expected
Then I ran
OMP_NUM_THREADS=128 numactl -m 0 -C 0-100 python scripts/inference.py --model_name mistral -m mistral_files/ne_mistral_int4.bin -c 512 -b 1024 -n 256 -t 128 --color -p "She opened the door and see"

and I get

Namespace(model_name='mistral', model=PosixPath('mistral_files/ne_mistral_int4.bin'), build_dir=PosixPath('/home/rachels.dell/neural-speed/scripts/../build'), prompt='She opened the door and see', tokenizer='THUDM/chatglm-6b', n_predict=256, threads=128, batch_size_truncate=1024, ctx_size=512, seed=-1, repeat_penalty=1.1, color=True, keep=0, shift_roped_k=False, memory_f32=False, memory_f16=False, memory_auto=False, one_click_run='False')
Please build graph first or select the correct model name.

any idea why?

Add support for phi3-vision

Please add support for phi3-vision to neural speed. According to benchmarks provided, it comes close to many SOTA multimodal models with only a fraction of the size - which is perfect for running on client CPUs and intel GPUs.

Error at Colab Inference of neural-chat-7b-v3-1 Model

Hey!
Thanks for the great project and for sharing it with the community.

I am trying to inference with the HF neural-chat model.

What I tried

In Colab,

!pip install intel-extension-for-transformers intel-extension-for-pytorch accelerate datasets neural-speed -q

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Behaviour

I got an error and below is the full trace.

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
2024-01-24 07:20:44 [INFO] cpu device is used.
2024-01-24 07:20:44 [INFO] Applying Weight Only Quantization.
2024-01-24 07:20:44 [INFO] Using LLM runtime.

cmd: ['python', PosixPath('/usr/local/lib/python3.10/dist-packages/neural_speed/convert/convert_mistral.py'), '--outfile', 'runtime_outs/ne_mistral_f32.bin', '--outtype', 'f32', 'Intel/neural-chat-7b-v3-1']

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

[<ipython-input-2-c1da0f81c837>](https://localhost:8080/#) in <cell line: 10>()
      8 streamer = TextStreamer(tokenizer)
      9 
---> 10 model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
     11 outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

1 frames

[/usr/local/lib/python3.10/dist-packages/neural_speed/__init__.py](https://localhost:8080/#) in init(self, model_name, use_quant, use_cache, use_gptq, use_awq, weight_dtype, alg, group_size, scale_dtype, compute_dtype, use_ggml)
    115         if not use_cache or not os.path.exists(fp32_bin):
    116             convert_model(model_name, fp32_bin, "f32")
--> 117             assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
    118 
    119         if not use_quant:

AssertionError: Fail to convert pytorch model

What I ask

  1. Why I get this error while trying the official example?
  2. How can I handle this error?

Thanks.

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

I’ve discovered a performance gap between the Neural Speed Matmul operator and the Llama.cpp operator in the Neural-Speed repository. This issue was identified while running a benchmark with the ONNXRuntime-GenAI tool.
The ONNXRuntime-GenAI tool was used to run a CPU-based int4 version of Phi-2 that utilizes the MatmulNBits operator. The performance of this was then compared with the metrics from the Llama.cpp operator.
The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps), while the Llama.cpp operator achieved a higher throughput of 22.75 tps. These metrics are end-to-end.
Upon profiling the MatmulNBits operator, it was identified as the bottleneck for the model. I will insert some performance metrics acquired with the onnxruntime profiling tool here for further analysis.

Past sequence length 29, Total sequence length 30

Name Duration Pct Count Cumulative Pct Cumulative Dur
MatMulNBits 4132089 87.39 15826 87.39 4132089
MultiHeadAttention 289191 6.12 2624 93.50 4421280
Add 131205 2.77 16072 96.28 4552485
FastGelu 67065 1.42 2624 97.69 4619550

Past sequence length 128, Total sequence length 129

Name Duration Pct Count Cumulative Pct Cumulative Dur
MatMulNBits 3882211 81.92 15440 81.92 3882211
MultiHeadAttention 576563 12.17 2560 94.08 4458774
Add 118635 2.50 15680 96.59 4577409
FastGelu 60107 1.27 2560 97.86 4637516

Past sequence length 512, Total sequence length 513

Name Duration Pct Count Cumulative Pct Cumulative Dur
MatMulNBits 3054838 62.79 11773 62.79 3054838
MultiHeadAttention 1582324 32.53 1952 95.32 4637162
Add 98730 2.03 11956 97.35 4735892
FastGelu 48359 0.99 1952 98.34 4784251

This issue needs to be addressed to improve the performance of the Neural Speed Matmul operator and bring it up to par with the Llama.cpp operator.

BF16 Compute DType on AVX512 ISA

In the Bestla README.md Weight only Quantization supported config is provided - https://github.com/intel/neural-speed/blob/main/bestla/README.md#weight-only

As Bestla supports the BF16 compute DType, I have quantized the model using quantize.py - https://github.com/intel/neural-speed/blob/main/scripts/quantize.py

Ex: python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 128 --compute_dtype bf16

During the Inference cycle, I noticed that for both FP32 and BF16 computation types F32 APIs are being triggered:
One scenario is with in QKV fusion BTLAGemmCompF32() is triggered with both F32 and BF16 -

void BTLAGemmCompF32(const int M, const int N, const int K, const float* A, const int lda,

Question 1: I would like to know, if I can use Bestla/Neural speed APIs for BF16 compute Dtype without falling back to F32 on AVX512 ISA and what about the input Dtype the API supports?

Distributing tensors across NUMA nodes

I'm wondering how much support Neural Speed has for NUMA systems. The Advanced Usage page suggests that all tensors should be allocated on the first NUMA node numactl -m 0 -C 0-<physic_cores-1>. Is there any benefit to doing this?

is qwen been supported?

Hello, I noticed that the llmruntime in ITREX seems to be well migrated to this project, I tried to use model.init() to load the qwen HF model, but python throws the error TypeError: Unspported model type qwen!
In my impression, the ITREX project supports the use of the qwen model, but it seems that after the update, the qwen model can no longer be used. Is there any problem with the qwen model? Will the qwen model be supported again in the future?

Bestla Kernels understanding and benchmarking

In OneDNN with Low precision datatype, we have support for u8s8s8 datatype. In Bestla Benchmark Infra we can find couple of classes for low precision types that includes (u8s8s32, s8s8s32 and some classes with different clip dtypes) - Ref: https://github.com/intel/neural-speed/blob/main/bestla/bestla/ut/bestla_benchmark.cpp

Question: With In Bestla do we have support only for output s32 (i.e u8s8s32/s8s8s32) or do we have also support for output s8 (i.e u8s8s8/s8s8s8)?


For Bestla Benchmark we have instructions here to build and benchmark with Bestla kernels (Ref: https://github.com/intel/neural-speed/tree/main/bestla#benchmark)

Question: Do we have any specific env variables that needs to be set to get best performance out of Bestla Kernels

Issue in whisper inference from pre-converted gguf

I get the following error when running whisper from already converted(to gguf) whisper model (i.e, using init_from_bin instead of init)-
`Traceback (most recent call last):
File "/root/ne/scripts/whisper_example.py", line 78, in
main()
File "/root/ne/scripts/whisper_example.py", line 72, in main
model.init_from_bin(args.model_name, gguf_path)
File "/usr/local/lib/python3.11/dist-packages/neural_speed/init.py", line 297, in init_from_bin
self.model.init_model(model_path, **_filter_model_args(valid_args, **generate_kwargs))
TypeError: init_model(): incompatible function arguments. The following argument types are supported:
1. (self: neural_speed.whisper_cpp.Model, model_path: str) -> None

Invoked with: <neural_speed.whisper_cpp.Model object at 0x7f0c33157d70>, '/root/ne/scripts/ne-q40.bin'; kwargs: threads=4`

Please suggest a solution

Error loading model when use qwen gguf model

it's a qwen base model download form hf, can inferencing with llama.cpp(latest version) but can't inferencing on latest version of neural-speed
run_qwen shows error:

Loading the bin file with GGUF format...
error loading model: unrecognized tensor type 13

Loading checkpoint shards takes too long

When I load "meta-llama/Meta-Llama-3-8B-Instruct" model like this

from transformers import AutoTokenizer, TextStreamer from intel_extension_for_transformers.transformers import AutoModelForCausalLM model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # Hugging Face model_id or local model tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) streamer = TextStreamer(tokenizer) model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)

it got hanged. Then only way is to restart instance to recover it.

Is there any issue in my spec?

my instance spec ubunu 32 GB RAM.

Can't inference Llama2 through GGUF

I've downloaded TheBloke/Llama-2-7B-Chat-GGUF from huggingface, and I use git lfs pull to download all ggufs.

Since my access to Meta/Llama 2 is not passed yet, I choose using KoboldAI/llama2-tokenizer as the tokenizer, and I'm running the code from readme, here is the error message I get:
Traceback (most recent call last): File "/neural-speed/llama.py", line 16, in <module> model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file) File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 165, in from_pretrained model.init_from_bin(model_type, gguf_model_file) File "/neural-speed/neural_speed/__init__.py", line 132, in init_from_bin self.__import_package(model_type) File "/neural-speed/neural_speed/__init__.py", line 46, in __import_package import neural_speed.llama_cpp as cpp_model ModuleNotFoundError: No module named 'neural_speed.llama_cpp'

Why is this happening?

Running Q4_K_M gguf models: unrecognized tensor type 12

Welcome to use the llama on the ITREX! 
AVX:1 AVX2:1 AVX512F:0 AVX_VNNI:1 AVX512_VNNI:0 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
Loading the bin file with GGUF format...
main: seed  = 1712361979
model.cpp: loading model from /models/llama-2-7b.Q4_K_S.gguf
error loading model: unrecognized tensor type 12

model_init_from_file: failed to load model

I got this error when trying to load the Q4_K_M and Q4_K_S quantized models for Llama-2-7B-GGUF. Would appreciate support could be added.

Is tensor parallelism supported by neural speed?

an example of TP has been provided by Neural speed document:
mpirun -np 2 -bind-to=socket ./build/bin/main_gptj -m ne-q4_0.bin --seed 1234 -t 56 -c 68 -n 32 -p "Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun." --no_mmap

but I didn't find main_gptj existed in build/bin path, and also didn't find the option no_mmap

does Neural Speed still support TP feature?

Performance on Xeon Scalable

Hello everyone, we are seeing slower than expected inference times on one of our CPU node with Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz with following instruction sets:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg rdpid fsrm md_clear flush_l1d arch_capabilities

With latest version of neuralchat_server and neural-speed in combination with intel-extension-for-transformers with following config:

host: "0.0.0.0"
port: 8000
model_name_or_path: "/root/Intel/neural-chat-7b-v3-3"
device: cpu
tasks_list: ["textchat"]

optimization:
  use_neural_speed: true
  optimization_type: weight_only
  compute_dtype: fp32
  weight_dtype: int8

We are seeing extremely slow time to first token with example prompts like Tell me about Intel Xeon Scalable Processors.

With following measured times :

Weight Precision Max Tokens Response Time
Int8 unset 73s
Int8 128 69s
Int4 unset 73s
Int4 128 65s

Without neural-speed compression of said model, we got inference times to only around 20s.

Is there any misconfiguration on our part?

I would love to hear your feedback and appreciate any help.

baseline example not working

This is with

!pip install neural-speed
!pip install intel-extension-for-transformers

Yes - accelerate is not installed - but the baseline example should work "out of the box"

image

Collecting intel-extension-for-transformers
  Downloading intel_extension_for_transformers-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (44.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.2/44.2 MB 13.4 MB/s eta 0:00:00
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from intel-extension-for-transformers) (24.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from intel-extension-for-transformers) (1.25.2)
Collecting schema (from intel-extension-for-transformers)
  Downloading schema-0.7.5-py2.py3-none-any.whl (17 kB)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from intel-extension-for-transformers) (6.0.1)
Collecting neural-compressor (from intel-extension-for-transformers)
  Downloading neural_compressor-2.5-py3-none-any.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 41.1 MB/s eta 0:00:00
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from intel-extension-for-transformers) (4.38.2)
Collecting deprecated>=1.2.13 (from neural-compressor->intel-extension-for-transformers)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Requirement already satisfied: opencv-python-headless in /usr/local/lib/python3.10/dist-packages (from neural-compressor->intel-extension-for-transformers) (4.9.0.80)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from neural-compressor->intel-extension-for-transformers) (1.5.3)
Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from neural-compressor->intel-extension-for-transformers) (9.4.0)
Requirement already satisfied: prettytable in /usr/local/lib/python3.10/dist-packages (from neural-compressor->intel-extension-for-transformers) (3.10.0)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from neural-compressor->intel-extension-for-transformers) (5.9.5)
Requirement already satisfied: py-cpuinfo in /usr/local/lib/python3.10/dist-packages (from neural-compressor->intel-extension-for-transformers) (9.0.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from neural-compressor->intel-extension-for-transformers) (2.31.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from neural-compressor->intel-extension-for-transformers) (1.2.2)
Requirement already satisfied: pycocotools in /usr/local/lib/python3.10/dist-packages (from neural-compressor->intel-extension-for-transformers) (2.0.7)
Requirement already satisfied: contextlib2>=0.5.5 in /usr/local/lib/python3.10/dist-packages (from schema->intel-extension-for-transformers) (21.6.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers->intel-extension-for-transformers) (3.13.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /usr/local/lib/python3.10/dist-packages (from transformers->intel-extension-for-transformers) (0.20.3)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers->intel-extension-for-transformers) (2023.12.25)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers->intel-extension-for-transformers) (0.15.2)
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers->intel-extension-for-transformers) (0.4.2)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers->intel-extension-for-transformers) (4.66.2)
Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.10/dist-packages (from deprecated>=1.2.13->neural-compressor->intel-extension-for-transformers) (1.14.1)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers->intel-extension-for-transformers) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers->intel-extension-for-transformers) (4.10.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->neural-compressor->intel-extension-for-transformers) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->neural-compressor->intel-extension-for-transformers) (2023.4)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prettytable->neural-compressor->intel-extension-for-transformers) (0.2.13)
Requirement already satisfied: matplotlib>=2.1.0 in /usr/local/lib/python3.10/dist-packages (from pycocotools->neural-compressor->intel-extension-for-transformers) (3.7.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->neural-compressor->intel-extension-for-transformers) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->neural-compressor->intel-extension-for-transformers) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->neural-compressor->intel-extension-for-transformers) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->neural-compressor->intel-extension-for-transformers) (2024.2.2)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->neural-compressor->intel-extension-for-transformers) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->neural-compressor->intel-extension-for-transformers) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->neural-compressor->intel-extension-for-transformers) (3.3.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools->neural-compressor->intel-extension-for-transformers) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools->neural-compressor->intel-extension-for-transformers) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools->neural-compressor->intel-extension-for-transformers) (4.50.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools->neural-compressor->intel-extension-for-transformers) (1.4.5)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools->neural-compressor->intel-extension-for-transformers) (3.1.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->neural-compressor->intel-extension-for-transformers) (1.16.0)
Installing collected packages: schema, deprecated, neural-compressor, intel-extension-for-transformers
Successfully installed deprecated-1.2.14 intel-extension-for-transformers-1.3.2 neural-compressor-2.5 schema-0.7.5

i wish for simpler way to run the model

i'm not well versed with python and where do i put the downloaded llama-2-7b-chat.Q4_0.gguf file?

i can make llama.cpp work real easy on my laptop but i cant seem to get this to work

i did git clone the neural speed, i did the pip install ... saved the file in run_model.py...

python run_model.py

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
(base) root@ubuntu:/usr/local/src/neural-speed# python run_model.py 
Traceback (most recent call last):
  File "/usr/local/src/neural-speed/run_model.py", line 2, in <module>
    from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
ImportError: cannot import name 'WeightOnlyQuantConfig' from 'intel_extension_for_transformers.transformers' (/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_transformers/transformers/__init__.py)
(base) root@ubuntu:/usr/local/src/neural-speed# 

Qwen2 GPTQ break in cpp_model.Model.np_bestla_qpack

Hi,
The model I used is Qwen1.5-0.5B-Chat-GPTQ-int4 from huggingface.
After debugging, it seems the model cannot be converted correctly by:
cpp_model.Model.np_bestla_qpack(

it breaks here without error or message shown.
The program can still continue to run though. And eventually it will show error when trying generation for the first time:
error loading model: model.cpp: tensor 'model.layers.0.self_attn.q_proj.weight' is missing from model

AVX_VNNI Numeric Bug?

$ git log
commit 9e20bd1072bb927613d55779b09752b05a348a9b (HEAD -> main, origin/main, origin/HEAD)
Author: luoyu-intel <[email protected]>
Date:   Fri Jan 5 10:44:12 2024 +0800

    make UT OFF as default. (#25)

    * make UT OFF as default.

    * change pointer to const void*

Generate weights:

$ python scripts/convert.py --outtype f32 --outfile llama2.ne-f32.bin ~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1b0db933684edbfe29a06fa47eb19cc48025e93
$ python scripts/quantize.py --model_name llama2 --model_file llama2.ne-f32.bin --out_file llama2.ne.weight-int4.group-128.compute-int8.bin --nthread 56 --weight_dtype int4 --group_size 128 --compute_dtype int8

If I run natively on SPR:

./build/bin/run_llama -s 0 --model_name llama -m llama2.ne.weight-int4.group-128.compute-int8.bin -c 512 -b 1024 -n 4 -t 1 -p "Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun."
...
Welcome to use the llama on the ITREX!
main: seed  = 0
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from llama2.ne.weight-int4.group-128.compute-int8.bin
init: n_vocab    = 32000
init: n_embd     = 4096
init: n_mult     = 256
init: n_head     = 32
init: n_head_kv  = 32
init: n_layer    = 32
init: n_rot      = 128
init: n_ff       = 11008
init: n_parts    = 1
load: ne ctx size = 3536.38 MB
load: mem required  = 5586.38 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  276.00 MB

system_info: n_threads = 1 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 1024, n_predict = 4, n_keep = 0


 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. One day, she

However, I'm only interested in AVX2/AVX_VNNI.
So instead I just disable AVX512 to exercise the AVX2 + AVX_VNNI paths:

218   inline bool AVX() { return mHasAVX; }
219   inline bool AVX2() { return mHasAVX2; }
220   inline bool AVX_VNNI() { return mHasAVX_VNNI; }
221   inline bool AVX512F() { return false && mHasAVX512F; }
222   inline bool AVX512_VNNI() { return false && mHasAVX512_VNNI; }
223   inline bool AMX_INT8() { return false && mHasAMX_INT8; }
224   inline bool AMX_BF16() { return false && mHasAMX_BF16; }
225   inline bool AVX512_BF16() { return false && mHasAVX512_BF16; }
226   inline bool AVX512_FP16() { return false && mHasAVX512_FP16; }

Now I see:

Welcome to use the llama on the ITREX!
main: seed  = 0
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from llama2.ne.weight-int4.group-128.compute-int8.bin
init: n_vocab    = 32000
init: n_embd     = 4096
init: n_mult     = 256
init: n_head     = 32
init: n_head_kv  = 32
init: n_layer    = 32
init: n_rot      = 128
init: n_ff       = 11008
init: n_parts    = 1
load: ne ctx size = 3536.38 MB
load: mem required  = 5586.38 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 0
model_init_from_file: kv self size =  128.00 MB

system_info: n_threads = 1 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 1024, n_predict = 4, n_keep = 0


 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun.tembretembretembre

Note the bogus generated tokens tembretembretembre vs One day, she.

Am I missing something or is there a numeric issue with the AVX2/AVX_VNNI path?

FWIW I didn't see this behavior with a previous version: https://github.com/intel/intel-extension-for-transformers/tree/c087c74da00711fcac37014cc8aea443c4b5fa82/intel_extension_for_transformers/llm/runtime/graph

We could use SDE for quickly comparing generated tokens for different ISAs - however if I run under SDE with spoofed CPU ID for Meteor Lake (i.e. sde -mtl ...) we see a segfault when its attempting to determine the hybrid config:

(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007fffb25c85aa in bestla::device::CpuDevice::CpuDevice (this=0x7fffb293d940 <bestla::device::CpuDevice::getInstance()::instance>) at /home/parvizmp/neural-speed/bestla/bestla/bestla_device.h:306
306             E_L1Cache = L1[E_core[0]];

Garbled characters with beam search

`
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Model()
model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8")

tokens = tokenizer("What's your favorite animal?", return_tensors='pt').input_ids

outputs = model.generate(tokens, num_beams=2, do_sample=False, max_new_tokens=10)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
`
With above code, I got below garbled characters.
"What's your favorite animal? ���������"

If I generate without beam search, I can get expected result.
outputs = model.generate(tokens)
"What's your favorite animal?
everybody has a favorite animal, and it's a"

Huge performance difference in "Transformer-like" usage and "llama.cpp-like" usage

Llama.cpp-like usage by running scripts is really fast but when I try to use from ITREX library (which makes use of Transformer-like usage) the performance difference is huge. Here is the time take by each approach:

  • Transformer-like usage: >10 mins
  • Llama.cpp-like usage: ~2 mins

Here's how I am using them:
Transformer Usage:

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) 

Llama.cpp usage:

python neural-speed/scripts/inference.py --model_name mistral -m runtime_outs/ne_mistral_q_nf4_bestla_cfp32_g32.bin -c 512 -b 1024 -n 256 --color -p "She opened the door and see"

Is there something that I am missing?

Build failure when building the executable

Current Behavior:

  • Building the executable results in a failure, all steps are successful except the last step:

image

Steps To Reproduce:

# Linux and WSL
git submodule update --init --recursive
mkdir build
cd build
cmake .. -G Ninja
ninja

Environment:

  • OS: Rocky Linux 8.8
  • HW: Intel(R) Xeon(R) Platinum 8470Q

Can't Load Qwen after support qwen2

After updating the support for qwen2, it seems that the qwen (qwen1) model cannot be loaded normally

model_quantize_internal: model size  = 54043.93 MB
model_quantize_internal: quant size  =  7217.12 MB
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.900000, continuous_batching: 0, max_request_num: 8, early_stopping: 0
model.cpp: loading model from runtime_outs/ne_qwen_q_int4_bestla_cint8_g128.bin
Loading the bin file with NE format...
read_ne_hparams  0.hparams.n_vocab = 152064                        
read_ne_hparams  1.hparams.n_embd = 5120                          
read_ne_hparams  2.hparams.n_mult = 27392                         
read_ne_hparams  3.hparams.n_head = 40                            
read_ne_hparams  4.hparams.n_head_kv = 0                             
read_ne_hparams  5.hparams.n_layer = 40                            
read_ne_hparams  6.hparams.n_rot = 152064                        
read_ne_hparams  7.hparams.ftype = 0                             
read_ne_hparams  8.hparams.max_seq_len = 2048                          
read_ne_hparams  9.hparams.alibi_bias_max = 0.000000                      
read_ne_hparams  10.hparams.clip_qkv = 0.000000                      
read_ne_hparams  11.hparams.par_res = 0                             
read_ne_hparams  12.hparams.word_embed_proj_dim = 0                             
read_ne_hparams  13.hparams.do_layer_norm_before = 0                             
read_ne_hparams  14.hparams.multi_query_group_num = 0                             
read_ne_hparams  15.hparams.ffn_hidden_size = 27392                         
read_ne_hparams  16.hparams.inner_hidden_size = 0                             
read_ne_hparams  17.hparams.n_experts = 0                             
read_ne_hparams  18.hparams.n_experts_used = 0                             
read_ne_hparams  19.hparams.inner_hidden_size = 0                             
read_ne_hparams  20.hparams.freq_base = 10000.000000                  
read_ne_hparams  21.hparams.freq_scale = 1.000000                      
read_ne_hparams  22.hparams.rope_scaling_factor = 0.000000                      
read_ne_hparams  23.hparams.original_max_position_embeddings = 0                             
read_ne_hparams  24.hparams.use_yarn = 0                             
read_ne_vocab    25.vocab.bos_token_id = 151643                        
read_ne_vocab    26.vocab.eos_token_id = 151643                        
read_ne_vocab    27.vocab.pad_token_id = 151643                        
read_ne_vocab    28.vocab.sep_token_id = -1                            
init: n_vocab    = 152064
init: n_embd     = 5120
init: n_mult     = 27392
init: n_head     = 40
init: n_head_kv  = 0
init: n_layer    = 40
init: n_rot      = 128
init: ftype      = 0
init: max_seq_len= 2048
init: n_ff       = 27392
init: n_parts    = 1
load: ne ctx size = 7217.30 MB
error loading model: model.cpp: tensor 'transformer.h.0.mlp.w1.weight' has wrong shape; expected  5120 x 27392, got  5120 x 13696
model_init_from_file: failed to load model
Segmentation fault (core dumped)

Linking back to Neural Chat / intel-extension-for-transformers

Hi y'all, I love these optimizations for inference with intel CPUs! I'm really excited to try them out as an alternative to llama.cpp that is more hightly optimized for use with Intel CPUs.

A small suggestion: you may want to link back to the Neural Chat server in the repo somewhere so that people know that you can use neural-speed via Neural Chat. Specifically, the config here was not super straightforward to find if you started by looking at the docs here, which leads towards the main neural-speed README. I almost fell down a rabbit hole of trying to integrate neural-speed with some other server framework based on the examples in the neural-speed repo. 😅 Also, no pressure, just a suggestion to save some pain for others.

Yi-6B model failed to evaluate

Hi there

I have been working on fine-tuning Yi-6B model. I would like to deploy the model using AMX with neural-speed. Unfortunately, there is an evaluation error when I used the generate command. I am not sure if this is a mistake on my side. But since this model is not on the supported list, I am opening this issue.

The script I used

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM


model_name = "hon9kon9ize/CantoneseLLMChat-v0.5"



tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)


model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)

streamer = TextStreamer(tokenizer)

def chat(messages, temperature=0.9, max_new_tokens=200):
    input_ids = tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, return_tensors="pt")
    output_ids = model.generate(input_ids, max_new_tokens=max_new_tokens, streamer=streamer, temperature=temperature, do_sample=True)
    
    chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
    print(chatml)

    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=False)

    return response

messages = [{"role": "user", "content": "邊個係香港特首?"}]


chat(messages)

Error message

2024-07-16 21:27:15 [INFO] cpu device is used.
2024-07-16 21:27:15 [INFO] Applying Weight Only Quantization.
2024-07-16 21:27:15 [INFO] Quantize model by Neural Speed with RTN Algorithm.
model.cpp: loading model from runtime_outs/ne_llama_q_nf4_bestla_cfp32_g32.bin
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
Loading the bin file with NE format...
init: n_vocab    = 64960
init: n_ctx      = 0
init: n_embd     = 4096
init: n_mult     = 256
init: n_head     = 32
init: n_head_kv  = 4
init: n_layer    = 32
init: n_rot      = 128
init: n_ff       = 11008
init: n_parts    = 1
load: ctx size   = 3621.86 MB
load: scratch0   = 4096.00 MB
load: scratch1   = 2048.00 MB
load: scratch2   = 4096.00 MB
load: mem required  = 13861.86 MB (+ memory per state)
..............................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =   69.00 MB
runtime_outs/ne_llama_q_nf4_bestla_cfp32_g32.bin existed, will use cache file. Otherwise please remove the file
<|im_start|><|Human|> 
邊個係香港特首?<|im_end|> 
<|im_start|><|Assistant|> 
llama_model_eval_internal: first token must be BOS (token id is 1) in 0th prompt
model_eval: failed to eval

<|im_start|><|Human|>
邊個係香港特首?<|im_end|>
<|im_start|><|Assistant|>

load_ne_hparams  0.hparams.n_vocab = 64960                         
load_ne_hparams  1.hparams.n_embd = 4096                          
load_ne_hparams  2.hparams.n_mult = 256                           
load_ne_hparams  3.hparams.n_head = 32                            
load_ne_hparams  4.hparams.n_head_kv = 4                             
load_ne_hparams  5.hparams.n_layer = 32                            
load_ne_hparams  6.hparams.n_rot = 128                           
load_ne_hparams  7.hparams.ftype = 0                             
load_ne_hparams  8.hparams.max_seq_len = 0                             
load_ne_hparams  9.hparams.alibi_bias_max = 0.000                         
load_ne_hparams  10.hparams.clip_qkv = 0.000                         
load_ne_hparams  11.hparams.par_res = 0                             
load_ne_hparams  12.hparams.word_embed_proj_dim = 0                             
load_ne_hparams  13.hparams.do_layer_norm_before = 0                             
load_ne_hparams  14.hparams.multi_query_group_num = 0                             
load_ne_hparams  15.hparams.ffn_hidden_size = 11008                         
load_ne_hparams  16.hparams.inner_hidden_size = 0                             
load_ne_hparams  17.hparams.n_experts = 0                             
load_ne_hparams  18.hparams.n_experts_used = 0                             
load_ne_hparams  19.hparams.n_embd_head_k = 0                             
load_ne_hparams  20.hparams.norm_eps = 0.000001                      
load_ne_hparams  21.hparams.freq_base = 5000000.000                   
load_ne_hparams  22.hparams.freq_scale = 1.000                         
load_ne_hparams  23.hparams.rope_scaling_factor = 0.000                         
load_ne_hparams  24.hparams.original_max_position_embeddings = 0                             
load_ne_hparams  25.hparams.use_yarn = 0                             
load_ne_vocab    26.vocab.bos_token_id = 1                             
load_ne_vocab    27.vocab.eos_token_id = 2                             
load_ne_vocab    28.vocab.pad_token_id = 0                             
load_ne_vocab    29.vocab.sep_token_id = -1    

context size of the model keeps fall to default of 512

HI I am running this:

export NEURAL_SPEED_VERBOSE=1
PROMPT1=$(cat <<'END_HEREDOC'
## some large text here about 1000 tokens
END_HEREDOC
)
python scripts/inference.py --model_name llama -m llama_files/ne_llama_int4.bin -c 1500  -n 400  --color -p "$PROMPT1"


model_print_timings:        load time =  1696.97 ms
model_print_timings:      sample time =   172.18 ms /   323 runs   (    0.53 ms per token)
model_print_timings: prompt eval time =  1696.79 ms /   512 tokens (    3.31 ms per token)
model_print_timings:        eval time = 21978.58 ms /   322 runs   (   68.26 ms per token)
model_print_timings:       total time = 23929.26 ms
========== eval time log of each prediction ==========

I keep getting in the output that the size of prompt is 512 but the actual size of the prompt is more than 1000. any advice on how to handle this? I use the -c 1500 requesting a large context size

Also when I use --keep 0 ( which is the defult) I still see the initial prompt printed to stdout. how do I make it not apearing as part of the output?

What is the difference between ITREX and Neural Speed?

Hi, I have been reading the blog and documentation here to try and understand the pros and cons of the transformers-like usage vs. the llama.cpp-like usage. And then how those things compare to ITREX.

I am confused because the transformers-like usage of Neural Speed only calls ITREX APIs and does not call any APIs from this repo. How does the transformers-like usage actually invoke Neural Speed?

Is the llama.cpp-like usage mainly a front-end for ITREX? Should I expect to see any performance differences if I opt to use ITREX instead of the llama.cpp-like usage?

Thanks in advance for helping my understanding.

Low text quality when inferencing a gguf model with Neural Speed vs llama.cpp

Current Behavior:

  • Genereted a gguf model from llama2-7b using llama.cpp code, and inferencing it directly with Neural Speed gives this text, and as u can see it's low quality:

image

  • While inference the exact same gguf file with llama.cpp gives me this text:

image

Steps To Reproduce:

# Windows
## llama.cpp
main.exe -m ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --no-mmap --ignore-eos
## Neural Speed
python scripts\inference.py --model_name llama2 -m ggml-model-q4_0.gguf -n 512 -p "Building a website can be done in 10 simple steps:"

Environment:

  • OS: Win11
  • HW: SPR w9-3595X E5 128GB

Error: Unable to install.

OS details: lsb_release -a

Distributor ID: Ubuntu
Description:    Ubuntu 23.10
Release:        23.10
Codename:       mantic

Python Version: python -V

Python 3.10.14

GCC Version: which gcc | xargs file

/usr/bin/gcc: symbolic link to gcc-13

CMake version: cmake --version

cmake version 3.29.3

Error snippet: pip install .

CMake Error: CMake was unable to find a build program corresponding to "Unix Makefiles".  CMAKE_MAKE_PROGRAM is not set.  You probably need to select a different build tool.
      CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
      CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
      -- Configuring incomplete, errors occurred!

...
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> neural-speed

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Error during - pip install . -

Hi all,

I'm trying to install neural-speed library. I'm following this tutorial Build Python package (Recommended way) .

System configuration:
OS : WSL2 - Linux DESKTOP-PNBMAG8 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Python: Python 3.10.12

My steps:

  1. CREATE PROJECT FOLDER: neural-speed-tutorial
  2. CREATE VIRTUAL ENV: python3 -m venv neural-speed-env
  3. CLONE NEURAL-SPEED REPOSITOTY: git clone https://github.com/intel/neural-speed.git
  4. RUN: pip install -r requirements.txt (successful)
  5. pip install .

TERMINAL LOG ERROR:
`
CMake Error in neural_speed/application/CMakeLists.txt:
Imported target "pybind11::module" includes non-existent path

      "/usr/include/python3.10"

    in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

    * The path was deleted, renamed, or moved to another location.

    * An install or uninstall procedure did not complete successfully.

    * The installation package was faulty and references files it does not
    provide.

`

What could be the solution? Did I miss any crucial steps during the installation or while executing the commands listed above?

Thank you for any suggestions.

is it supported with Batch size >1 ?

Hi all,

is it supported with bs >1? found the following:

if (batch_size > 1)
MODEL_ASSERT(
("llama arch only supports continuous batching inference when giving multi prompts.", lctx.cont_batching));

thanks

i saw how beautiful this repo is, in terms of parallelism / numa stuff etc.

i understand this is intel repo but curious will amd work as well or... what kind of architecture / intel chip set is best used with this repo?

about to upgrade hardware and looking forward to suggestion with use with this repo.

have yet to run it. but hope to know what chipset is best used with this repo

Question about Thread pool and GEMV

Because I've been working on efficient GeMV multiplication on CPUs lately, I've found that I'm only going to be able to make a limited amount of improvement after adopting SIMD. Referring to your BesTLA library might inspire me, I'm really looking forward to BestLA's GeMV kernels.
Also, I have a question about BesTLA's thread pool, is it based on a custom thread pool, or is it based on OpenMP?

btw, I'm looking forward to seeing BesTLA become more widely used, or be called separately, just like OpenBLAS.

Documentation for whisper inference

I tried running inference(transformer like usage, because apparently llama.cpp like usage is not available for whisper) , installed intel_extension_for_transformers, but now it fails on

import neural_speed.whisper_cpp as cpp_model
ModuleNotFoundError: No module named 'neural_speed.whisper_cpp

I installed neural-speed in the way it is mentioned in docs, i.e.,

pip install -r requirements.txt
pip install .

and was successful in running phi-1.5 inference in llama.cpp way.

Please guide how to run whisper inference and like other models also add 3-bit inference support to whisper

Modifying the models hyperparameters

Hello there,

I'm new to neural speed, coming from llama-cpp-python, and i encounter some problems (probably due to a misunderstanding on my side).

I don't want to flood you with issues so I'll start with my two main questions :

  • is there a way to change the model hyperparameters (the temperature, mostly) ?
  • is there a way to not use a tokenizer coming from HF, and instead do like llama-cpp and use the tokenizer included in the .gguf file ? (in my use case, I'd like to not depend on an external lib)

Thank you !

ModuleNotFoundError: No module named 'neural_speed.mistral_cpp'

I was trying to run this demo code:

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

And I keep getting this error in _"neural-speed/neural_speed/init.py", line 68, in _import_package :
ModuleNotFoundError: No module named 'neural_speed.mistral_cpp'

Sycl support ?

Are you planning to support sycl as well for core ultra CPUs - leveraging the igpu along with the cpu . This would make local runs faster for long context length RAG applications.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.