Skip to content

Conversation

@pctablet505
Copy link

@pctablet505 pctablet505 commented Jan 30, 2026

  • Add index_map_.reserve(vocab.size()) in StringVocab constructor to prevent hash map rehashing
  • Add comprehensive regression tests for StringVocab hash map bug

This pull request fixes a bug in the StringVocab implementation that could cause the FastWordpieceTokenizer to fail when the vocabulary size is 7 or more and the unknown token is not the last element. The fix ensures that the internal hash map is properly reserved to prevent rehashing issues during construction. Additionally, a new set of regression tests has been added to verify that the bug is resolved for various vocabulary configurations.

Bug fix in vocabulary hash map construction:

  • tensorflow_text/core/kernels/string_vocab.cc: Added a call to index_map_.reserve(vocab.size()) in the StringVocab constructor to prevent hash map rehashing, which previously caused lookup failures for the unknown token when vocabularies had 7 or more entries and the unknown token was not last.

Regression tests for tokenizer behavior:

  • tensorflow_text/python/ops/fast_wordpiece_tokenizer_test.py: Added the StringVocabHashMapBugTest class with multiple tests to confirm the tokenizer works correctly for vocabularies of size 6, 7, and 8, with the unknown token in various positions. These tests ensure the bug is fixed and prevent regressions.

Fixes #1462

- Add index_map_.reserve(vocab.size()) in StringVocab constructor to prevent hash map rehashing
- Add comprehensive regression tests for StringVocab hash map bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when vocabulary size >= 7 despite unk_token being present

1 participant