BeeTokenizer

note: this is literally a tokenizer trained on beekeeping text

After minutes of hard work, it is now available.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/BeeTokenizer")

test_string = "When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination."

output = tokenizer(test_string)
print(f"Test string: {test_string}")
print(f"Tokens:\n\t{output.input_ids}")

Notes

the default tokenizer (on branch main) has a vocab size of 32128

<details> <summary>How to Tokenize Text and Retrieve Offsets</summary>

To tokenize a complex sentence and also retrieve the offsets mapping, you can use the following Python code snippet:

from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/BeeTokenizer")

# Sample complex sentence related to beekeeping
test_string = "When dealing with Varroa destructor mites, it's crucial to administer the right acaricides during the late autumn months, but only after ensuring that the worker bee population is free from pesticide contamination."

# Tokenize the input string and get the offsets mapping
output = tokenizer.encode_plus(test_string, return_offsets_mapping=True)

print(f"Test string: {test_string}")

# Tokens
tokens = tokenizer.convert_ids_to_tokens(output['input_ids'])
print(f"Tokens: {tokens}")

# Offsets
offsets = output['offset_mapping']
print(f"Offsets: {offsets}")

BeeTokenizer

Notes

NSDT 3DConvert

UnrealSynth

DreamTexture.js