Add a fast path that doesn't include normalized chunks in tokenize #11017
+12
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
whats-new.rstThe idea in this PR is to include a fast path for
open_datasetthat just uses the token that is passed into_maybe_chunkand doesn't worry about including chunks within the token.Before:

After:

This PR shaves ~30 sec off the previous runtime for the dataset from the original issue. I was still seeing pretty intense memory consumption 17.14GB for this
open_datasetcall though - not a new thing, just wanted to flag