HuggingFaceGECLM/mix_tok_v2 - AI Model Zoo - BimAnt

V1 of an English/code tokenizer. Byte-level BPE, 64k vocab, split digits (the difference with v1). Equal mix between: On the NL side:

Books
C4
v1 of our CC (helen quality classifier)
enwiki
Gutenberg
Reddit

On the code side:

Jupyter notebooks (0.5 weight, it was small)
GH issues
Stackexchange
The cleaned Python Stack

For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).

NSDT 3DConvert

Convert 30+ 3D formats online: GLTF, GLB, GBX, OBJ, DAE, IFC, STEP, STL...

UnrealSynth

Unreal engine based photo realistic synthetic data generator for YOLO.

DreamTexture.js

AI powered 3d texture generation and projection SDK for three.js.