-
Notifications
You must be signed in to change notification settings - Fork 97
Open
Labels
bugSomething isn't workingSomething isn't working
Description
- What versions are you using?
oracledb.__version__ = 3.4.1 | Thin = True
DB 19.29
platform.platform: Windows-11-10.0.22631-SP0
sys.maxsize > 2**32: True
platform.python_version: 3.12.2
-
Is it an error or a hang or a crash?
Error, suspecting memory leak -
What error(s) or behavior you are seeing?
When using connection.direct_path_load in loop (saving large pandas data frame), after each loop the memory taken by python precess gets bigger and bigger.
It seems to be because the pandas df is directly given as dta parameter.
In older oracledb versions when it was not possible and data was provided as list of tuples such behavior is not observed.
Output from sample script (provided in 5.), showing rising memory footprint:
Generating DataFrame...
DF shape: (100000, 3) | dtypes:
IDD int64
DESCRIPTION string[python]
SOME_DATE datetime64[ns]
dtype: object
After DF generation - RSS: 145.29 MB
Connecting (Thin mode)...
oracledb.__version__ = 3.4.1 | Thin = True
DB V.:19.29.0.0.0
Starting 10 direct path loads into BI.Z_TEST_DIRPATH...
After load 001 - RSS: 234.39 MB
Load 001 completed in 4.97s
After load 002 - RSS: 317.71 MB
Load 002 completed in 4.82s
After load 003 - RSS: 402.05 MB
Load 003 completed in 4.74s
After load 004 - RSS: 451.36 MB
Load 004 completed in 4.81s
After load 005 - RSS: 500.50 MB
Load 005 completed in 4.77s
After load 006 - RSS: 550.57 MB
Load 006 completed in 4.43s
After load 007 - RSS: 603.52 MB
Load 007 completed in 4.72s
After load 008 - RSS: 652.11 MB
Load 008 completed in 4.76s
After load 009 - RSS: 663.86 MB
Load 009 completed in 4.73s
After load 010 - RSS: 746.16 MB
Load 010 completed in 4.80s
Finished 10 loads in 47.74s
Final - RSS: 746.16 MB
- Does your application call init_oracle_client()?
No
- Include a runnable Python script that shows the problem.
SQL:
CREATE TABLE Z_TEST_DIRPATH (
IDD NUMBER,
DESCRIPTION VARCHAR2(1000),
SOME_DATE DATE
);
Pyton script:
import os
import time
import gc
from datetime import datetime, timedelta
import pandas as pd
import oracledb
# Optional: report OS-level RSS memory
try:
import psutil
HAVE_PSUTIL = True
except ImportError:
HAVE_PSUTIL = False
# ------------------------------
# Connection settings (EDIT or ENV)
# ------------------------------
DB_USER = os.getenv("DB_USER", "your_user")
DB_PASSWORD = os.getenv("DB_PASSWORD", "your_password")
DB_DSN = os.getenv("DB_DSN", "host/service") # e.g. "dbhost.example.com/orclpdb"
SCHEMA = os.getenv("DB_SCHEMA", "schema")
TABLE = os.getenv("DB_TABLE", "Z_TEST_DIRPATH")
# Thin mode is required for direct_path_load; do NOT call init_oracle_client()
# (direct_path_load implicitly commits each load). [1](https://python-oracledb.readthedocs.io/en/latest/user_guide/batch_statement.html)
# ------------------------------
# DataFrame generator
# ------------------------------
def make_test_df(n_rows: int = 100_000) -> pd.DataFrame:
"""
Create a DataFrame with columns:
- IDD: 1..n_rows (int)
- DESCRIPTION: variable-length text 250..750 chars (pseudo-random pattern, deterministic)
- SOME_DATE: date values (deterministic, pseudo-random spread)
We avoid heavy random to keep generation fast & reproducible.
"""
# IDD
idd = pd.Series(range(1, n_rows + 1), dtype="int64", name="IDD")
# DESCRIPTION: build a long base string and slice variable lengths 250..750
base = (
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
"Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut "
"aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in "
"voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint "
"occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit "
"anim id est laborum. "
)
# ensure we have enough characters to slice
base = (base * 20)[:1000]
lengths = ((idd % 501) + 250).astype("int64") # 250..750
# vectorized slicing via map is acceptable; for large sizes, list comp is faster
desc = [base[:int(L)] for L in lengths]
desc = pd.Series(desc, dtype="string", name="DESCRIPTION") # use string dtype to map to VARCHAR2
# SOME_DATE: pick a deterministic spread over ~5 years
start = datetime(2020, 1, 1)
# offsets in days & seconds using simple modular arithmetic (no true randomness)
day_offsets = (idd % 1825).astype("int64") # up to 5 years
sec_offsets = ((idd * 97) % 86400).astype("int64") # up to 1 day
dates = [start + timedelta(days=int(d), seconds=int(s)) for d, s in zip(day_offsets, sec_offsets)]
some_date = pd.to_datetime(pd.Series(dates, name="SOME_DATE"))
df = pd.concat([idd, desc, some_date], axis=1)
return df
# ------------------------------
# Memory helper
# ------------------------------
def report_memory(prefix: str = ""):
if HAVE_PSUTIL:
rss = psutil.Process().memory_info().rss / (1024 * 1024)
print(f"{prefix}RSS: {rss:,.2f} MB")
else:
# Fallback: show DataFrame object sizes (Python-level, not C-allocations)
print(f"{prefix}(psutil not installed)")
# ------------------------------
# Main test: 10 x direct_path_load
# ------------------------------
def main():
# Generate DF once and reuse to better expose leak-like growth if any
print("Generating DataFrame...")
df = make_test_df(100_000)
print(f"DF shape: {df.shape} | dtypes:\n{df.dtypes}")
report_memory(prefix="After DF generation - ")
print("Connecting (Thin mode)...")
# Thin mode by default — do not call init_oracle_client()
with oracledb.connect(user=DB_USER, password=DB_PASSWORD, dsn=DB_DSN) as conn:
# Optional: verify Thin mode & versions
print(f"oracledb.__version__ = {oracledb.__version__} | Thin = {conn.thin}")
print(f"DB V.:{conn.version}")
# Direct Path Loads are supported only in Thin mode. [1](https://python-oracledb.readthedocs.io/en/latest/user_guide/batch_statement.html)
column_names = ["IDD", "DESCRIPTION", "SOME_DATE"]
N = 10
print(f"Starting {N} direct path loads into {SCHEMA}.{TABLE}...")
start_ts = time.time()
for i in range(1, N + 1):
t0 = time.time()
# You can optionally tune batch_size; it’s supported for direct_path_load.
# If omitted, driver chooses defaults.
conn.direct_path_load(
schema_name=SCHEMA,
table_name=TABLE,
column_names=column_names,
data=df, #new feature - direct from pandas df
# data = [tuple(x) for x in df.values] # old style - list of tuples
)
# implicit commit here (per call)
t1 = time.time()
gc.collect() # ensure Python heap is cleaned between iterations
report_memory(prefix=f"After load {i:03d} - ")
print(f"Load {i:03d} completed in {(t1 - t0):.2f}s")
total = time.time() - start_ts
print(f"Finished {N} loads in {total:.2f}s")
report_memory(prefix="Final - ")
if __name__ == "__main__":
main()
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working