Chris Ha's Projects
The original tooling for the OSCAR corpus rewritten in Rust
The website of the Oscar Project
Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM
Pretraining Efficiently on S2ORC!
Perform transformations on PII instances detected in documents
PyTorch image models, scripts, pretrained weights -- (SE)ResNet/ResNeXT, DPN, EfficientNet, MixNet, MobileNet-V3/V2/V1, MNASNet, Single-Path NAS, FBNet, and more
a pytorch lightning based repo of the venerable pytorch lightning image models repo
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
(ImageNet pretrained models) The official pytorch implemention of the TPAMI paper "Res2Net: A New Multi-scale Backbone Architecture"
RFCs for changes to Rust
A fast Bloom filter implementation in Rust
Unsupervised text tokenizer for Neural Network-based text generation.
A fast Rust JSON library based on SIMD.
Solves subset sum problem and returns a set of decomposed integers.
Fast suffix arrays for Rust (with Unicode support).
Pure Rust multimedia format demuxing, tag reading, and audio decoding library
All-in-one text de-duplication
xxh enhanced version of Rust port of TLSH
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
tlsh with pyo3 and soon xxhash
:spider: The pipeline for the OSCAR corpus
char <-> Unicode character name (maintained fork of huonw/unicode_names)
very efficient rank and select
Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"
Centralised repository for WARC usage specifications.
wyhash fast portable non-cryptographic hashing algorithm and random number generator in Rust
Rust raw bindings to xxHash