Fix FastWordpieceTokenizer bug for vocabularies >= 7 tokens #1463

pctablet505 · 2026-01-30T07:51:42Z

Add index_map_.reserve(vocab.size()) in StringVocab constructor to prevent hash map rehashing
Add comprehensive regression tests for StringVocab hash map bug

This pull request fixes a bug in the StringVocab implementation that could cause the FastWordpieceTokenizer to fail when the vocabulary size is 7 or more and the unknown token is not the last element. The fix ensures that the internal hash map is properly reserved to prevent rehashing issues during construction. Additionally, a new set of regression tests has been added to verify that the bug is resolved for various vocabulary configurations.

Bug fix in vocabulary hash map construction:

tensorflow_text/core/kernels/string_vocab.cc: Added a call to index_map_.reserve(vocab.size()) in the StringVocab constructor to prevent hash map rehashing, which previously caused lookup failures for the unknown token when vocabularies had 7 or more entries and the unknown token was not last.

Regression tests for tokenizer behavior:

tensorflow_text/python/ops/fast_wordpiece_tokenizer_test.py: Added the StringVocabHashMapBugTest class with multiple tests to confirm the tokenizer works correctly for vocabularies of size 6, 7, and 8, with the unknown token in various positions. These tests ensure the bug is fixed and prevent regressions.

Fixes #1462

- Add index_map_.reserve(vocab.size()) in StringVocab constructor to prevent hash map rehashing - Add comprehensive regression tests for StringVocab hash map bug

Fix FastWordpieceTokenizer bug for vocabularies >= 7 tokens

fac118f

- Add index_map_.reserve(vocab.size()) in StringVocab constructor to prevent hash map rehashing - Add comprehensive regression tests for StringVocab hash map bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FastWordpieceTokenizer bug for vocabularies >= 7 tokens #1463

Fix FastWordpieceTokenizer bug for vocabularies >= 7 tokens #1463

Uh oh!

pctablet505 commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix FastWordpieceTokenizer bug for vocabularies >= 7 tokens #1463

Are you sure you want to change the base?

Fix FastWordpieceTokenizer bug for vocabularies >= 7 tokens #1463

Uh oh!

Conversation

pctablet505 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pctablet505 commented Jan 30, 2026 •

edited

Loading