-
Notifications
You must be signed in to change notification settings - Fork 112
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Issue Overview
The model property in EmbeddingService lazy-loads the SentenceTransformer model without any thread-safety.
This can cause multiple concurrent requests to load the model at the same time, leading to memory waste, race conditions, or crashes in a multi-threaded or async environment.
Steps to Reproduce
- Start the backend service.
- Trigger multiple concurrent calls to any method that accesses EmbeddingService.model (e.g., get_embedding or get_embeddings).
- Observe that the model is loaded multiple times concurrently (check logs or GPU memory usage).
- Optionally, run stress tests with multiple async profile summaries to see potential blocking or crashes.
Expected Behavior
The model should be loaded only once, regardless of how many concurrent requests access it.
No race conditions or duplicate memory usage should occur.
Actual Behavior
Multiple threads or async tasks can instantiate the model multiple times concurrently.
This may lead to high memory usage, GPU resource exhaustion, or task failures.
Suggested Improvements
Use a thread-safe lock when lazy-loading the model:
import threading
class EmbeddingService:
_model_lock = threading.Lock()
@property
def model(self) -> SentenceTransformer:
if self._model is None:
with self._model_lock:
if self._model is None:
self._model = SentenceTransformer(self.model_name, device=self.device)
return self._model
This ensures only one instance of the model is created, preventing race conditions and resource duplication.
Record
- I agree to follow this project's Code of Conduct
- I want to work on this issue