Fix FastWordpieceTokenizer bug for vocabularies >= 7 tokens #1463
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request fixes a bug in the
StringVocabimplementation that could cause theFastWordpieceTokenizerto fail when the vocabulary size is 7 or more and the unknown token is not the last element. The fix ensures that the internal hash map is properly reserved to prevent rehashing issues during construction. Additionally, a new set of regression tests has been added to verify that the bug is resolved for various vocabulary configurations.Bug fix in vocabulary hash map construction:
tensorflow_text/core/kernels/string_vocab.cc: Added a call toindex_map_.reserve(vocab.size())in theStringVocabconstructor to prevent hash map rehashing, which previously caused lookup failures for the unknown token when vocabularies had 7 or more entries and the unknown token was not last.Regression tests for tokenizer behavior:
tensorflow_text/python/ops/fast_wordpiece_tokenizer_test.py: Added theStringVocabHashMapBugTestclass with multiple tests to confirm the tokenizer works correctly for vocabularies of size 6, 7, and 8, with the unknown token in various positions. These tests ensure the bug is fixed and prevent regressions.Fixes #1462