Skip to content

Memory leak suspected when direct_path_load uses pandas dataframe as input #561

@AigarsAk

Description

@AigarsAk
  1. What versions are you using?
    oracledb.__version__ = 3.4.1 | Thin = True
    DB 19.29
platform.platform: Windows-11-10.0.22631-SP0
sys.maxsize > 2**32: True
platform.python_version: 3.12.2
  1. Is it an error or a hang or a crash?
    Error, suspecting memory leak

  2. What error(s) or behavior you are seeing?

  When using connection.direct_path_load in loop (saving large pandas data frame), after each loop the memory taken by python precess gets bigger and bigger.
  It seems to be because the pandas df is directly given as dta parameter. 
  In older oracledb versions when it was not possible and data was provided as list of tuples such behavior is not observed.
Output from sample script (provided in 5.), showing rising memory footprint:
Generating DataFrame...
DF shape: (100000, 3) | dtypes:
IDD                     int64
DESCRIPTION    string[python]
SOME_DATE      datetime64[ns]
dtype: object
After DF generation - RSS: 145.29 MB
Connecting (Thin mode)...
oracledb.__version__ = 3.4.1 | Thin = True
DB V.:19.29.0.0.0
Starting 10 direct path loads into BI.Z_TEST_DIRPATH...
After load 001 - RSS: 234.39 MB
Load 001 completed in 4.97s
After load 002 - RSS: 317.71 MB
Load 002 completed in 4.82s
After load 003 - RSS: 402.05 MB
Load 003 completed in 4.74s
After load 004 - RSS: 451.36 MB
Load 004 completed in 4.81s
After load 005 - RSS: 500.50 MB
Load 005 completed in 4.77s
After load 006 - RSS: 550.57 MB
Load 006 completed in 4.43s
After load 007 - RSS: 603.52 MB
Load 007 completed in 4.72s
After load 008 - RSS: 652.11 MB
Load 008 completed in 4.76s
After load 009 - RSS: 663.86 MB
Load 009 completed in 4.73s
After load 010 - RSS: 746.16 MB
Load 010 completed in 4.80s
Finished 10 loads in 47.74s
Final - RSS: 746.16 MB
  1. Does your application call init_oracle_client()?
    No
  1. Include a runnable Python script that shows the problem.
    SQL:
CREATE TABLE Z_TEST_DIRPATH (
  IDD         NUMBER,
  DESCRIPTION VARCHAR2(1000),
  SOME_DATE   DATE
);

Pyton script:

import os
import time
import gc
from datetime import datetime, timedelta

import pandas as pd
import oracledb

# Optional: report OS-level RSS memory
try:
    import psutil
    HAVE_PSUTIL = True
except ImportError:
    HAVE_PSUTIL = False


# ------------------------------
# Connection settings (EDIT or ENV)
# ------------------------------
DB_USER = os.getenv("DB_USER", "your_user")
DB_PASSWORD = os.getenv("DB_PASSWORD", "your_password")
DB_DSN = os.getenv("DB_DSN", "host/service")  # e.g. "dbhost.example.com/orclpdb"
SCHEMA = os.getenv("DB_SCHEMA", "schema")
TABLE = os.getenv("DB_TABLE", "Z_TEST_DIRPATH")

# Thin mode is required for direct_path_load; do NOT call init_oracle_client()
# (direct_path_load implicitly commits each load).  [1](https://python-oracledb.readthedocs.io/en/latest/user_guide/batch_statement.html)

# ------------------------------
# DataFrame generator
# ------------------------------
def make_test_df(n_rows: int = 100_000) -> pd.DataFrame:
    """
    Create a DataFrame with columns:
      - IDD: 1..n_rows (int)
      - DESCRIPTION: variable-length text 250..750 chars (pseudo-random pattern, deterministic)
      - SOME_DATE: date values (deterministic, pseudo-random spread)
    We avoid heavy random to keep generation fast & reproducible.
    """
    # IDD
    idd = pd.Series(range(1, n_rows + 1), dtype="int64", name="IDD")

    # DESCRIPTION: build a long base string and slice variable lengths 250..750
    base = (
        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
        "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
        "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut "
        "aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in "
        "voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint "
        "occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit "
        "anim id est laborum. "
    )
    # ensure we have enough characters to slice
    base = (base * 20)[:1000]

    lengths = ((idd % 501) + 250).astype("int64")  # 250..750
    # vectorized slicing via map is acceptable; for large sizes, list comp is faster
    desc = [base[:int(L)] for L in lengths]
    desc = pd.Series(desc, dtype="string", name="DESCRIPTION")  # use string dtype to map to VARCHAR2

    # SOME_DATE: pick a deterministic spread over ~5 years
    start = datetime(2020, 1, 1)
    # offsets in days & seconds using simple modular arithmetic (no true randomness)
    day_offsets = (idd % 1825).astype("int64")  # up to 5 years
    sec_offsets = ((idd * 97) % 86400).astype("int64")  # up to 1 day
    dates = [start + timedelta(days=int(d), seconds=int(s)) for d, s in zip(day_offsets, sec_offsets)]
    some_date = pd.to_datetime(pd.Series(dates, name="SOME_DATE"))

    df = pd.concat([idd, desc, some_date], axis=1)
    return df

# ------------------------------
# Memory helper
# ------------------------------
def report_memory(prefix: str = ""):
    if HAVE_PSUTIL:
        rss = psutil.Process().memory_info().rss / (1024 * 1024)
        print(f"{prefix}RSS: {rss:,.2f} MB")
    else:
        # Fallback: show DataFrame object sizes (Python-level, not C-allocations)
        print(f"{prefix}(psutil not installed)")

# ------------------------------
# Main test: 10 x direct_path_load
# ------------------------------
def main():
    # Generate DF once and reuse to better expose leak-like growth if any
    print("Generating DataFrame...")
    df = make_test_df(100_000)
    print(f"DF shape: {df.shape} | dtypes:\n{df.dtypes}")
    report_memory(prefix="After DF generation - ")

    print("Connecting (Thin mode)...")
    # Thin mode by default — do not call init_oracle_client()
    with oracledb.connect(user=DB_USER, password=DB_PASSWORD, dsn=DB_DSN) as conn:

        # Optional: verify Thin mode & versions
        print(f"oracledb.__version__ = {oracledb.__version__} | Thin = {conn.thin}")
        print(f"DB V.:{conn.version}")
        # Direct Path Loads are supported only in Thin mode. [1](https://python-oracledb.readthedocs.io/en/latest/user_guide/batch_statement.html)

        column_names = ["IDD", "DESCRIPTION", "SOME_DATE"]


        N = 10
        print(f"Starting {N} direct path loads into {SCHEMA}.{TABLE}...")
        start_ts = time.time()

        for i in range(1, N + 1):
            t0 = time.time()
            # You can optionally tune batch_size; it’s supported for direct_path_load.
            # If omitted, driver chooses defaults.
                      
            conn.direct_path_load(
                schema_name=SCHEMA,
                table_name=TABLE,
                column_names=column_names,
                data=df,  #new feature - direct from pandas df
                # data = [tuple(x) for x in df.values]   # old style - list of tuples
            )
            
            
      
            # implicit commit here (per call)
            t1 = time.time()
            gc.collect()  # ensure Python heap is cleaned between iterations
            report_memory(prefix=f"After load {i:03d} - ")
            print(f"Load {i:03d} completed in {(t1 - t0):.2f}s")

        total = time.time() - start_ts
        print(f"Finished {N} loads in {total:.2f}s")
        report_memory(prefix="Final - ")

if __name__ == "__main__":
    main()

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions