Site icon Bugra Parlayan | Oracle Database Blog

Oracle Machine Learning for Python (OML4Py) 2.1: Comprehensive Technical Guide

1. What is OML4Py? Definition and Purpose

Oracle Machine Learning for Python (OML4Py) is an Oracle technology that integrates the rich data science and machine learning (ML) ecosystem of the Python programming language directly into the secure, scalable, and high-performance environment of the Oracle Database. Its primary goal is to enable Python users (data scientists, developers, analysts) to develop, train, and deploy machine learning models on data residing within the database, without needing to move the data out of the database.  

The most critical advantage of this approach is the elimination of data movement. Traditional ML workflows often involve extracting data from the database, moving it to a separate analysis environment (like a data scientist’s laptop or a dedicated ML server), processing it, building models, and then potentially loading results back. This process is time-consuming, costly, and introduces security risks. Moving large and sensitive datasets (financial records, patient information, etc.) outside the database can lead to compliance issues (e.g., GDPR) and increases the risk of data breaches. OML4Py addresses these challenges by keeping the entire ML process within the database. Data is processed and modeled at the source, maximizing data security, simplifying governance, and accelerating the analysis lifecycle.  

OML4Py targets a broad audience:

OML4Py is a key component of Oracle’s strategy to transform the database from a simple data repository into a powerful analytical and machine learning platform.  

2. What’s New in OML4Py 2.1?

OML4Py version 2.1, especially when used with Oracle Database 23ai, introduces significant new capabilities and enhancements, empowering users to implement more modern and diverse ML workflows within the database.  

These innovations make OML4Py a more powerful, flexible, and modern machine learning tool within the Oracle ecosystem.

3. Core Capabilities and Example Code (Scripts)

OML4Py provides an end-to-end machine learning workflow within Oracle Database for Python users. Key capabilities and practical code examples are detailed below.

3.1. Database Connection and Data Access

The first step is establishing a secure connection from the Python session to the Oracle Database, granting access to database resources (data, compute, storage).  

3.2. In-Database Model Training

OML4Py provides Python access to optimized ML algorithms built into the Oracle Database (classification, regression, clustering, feature extraction, etc.). Algorithms include GLM, SVM, Decision Trees, Naive Bayes, K-Means, O-Cluster, NMF, ESA, XGBoost, Neural Networks, EM, SVD, PCA, and Association Rules.  

Models are trained using a scikit-learn-like interface: instantiate oml.<algo_name>() and use the .fit(X, y) method, where X (features) and y (target) are OML DataFrame proxy objects. This executes the training entirely within the database, leveraging its parallelism and eliminating the need to pull large datasets into Python memory.  

Python

# Script Example 3: In-Database Model Training (Decision Tree)
import oml
from sklearn.model_selection import train_test_split # Only to split IDs

# Assuming connection established and IRIS_TABLE_OML4PY exists
if not oml.isconnected():
    print("Please connect to the database first.")
else:
    try:
        table_name = 'IRIS_TABLE_OML4PY'
        oml_iris_proxy = oml.sync(table=table_name)

        # Add ID and split for train/test (in-DB random sampling is often more efficient)
        oml_iris_proxy = oml_iris_proxy.assign(ID='rownum') # Add unique ID (DB function)
        ids = oml_iris_proxy['ID'].pull() # Pull only IDs to Python

        # Split IDs for train/test
        train_ids_pd, test_ids_pd = train_test_split(ids, test_size=0.3, random_state=42, stratify=oml_iris_proxy['species'].pull())

        # Create train/test OML DataFrames by filtering on IDs (in-database)
        train_data = oml_iris_proxy[oml_iris_proxy['ID'].isin(train_ids_pd['ID'].tolist())]
        test_data = oml_iris_proxy[oml_iris_proxy['ID'].isin(test_ids_pd['ID'].tolist())]

        print(f"Training set size (DB): {train_data.shape}")
        print(f"Test set size (DB): {test_data.shape}")

        # Separate features (X) and target (y) (proxy objects)
        train_X = train_data.drop(['species', 'ID']) # Drop ID from training features
        train_y = train_data['species']

        # Initialize and train Decision Tree model
        dt_settings = {'TREE_TERM_MAX_DEPTH': 5, 'TREE_IMPURITY_METRIC': 'TREE_IMPURITY_GINI'}
        dt_model = oml.dt(setting=dt_settings)

        # Fit the model (executes in-database)
        # case_id should be a unique row identifier
        dt_model.fit(train_X, train_y, case_id='ID')

        print("\nDecision Tree model trained successfully in-database.")
        print(f"Model Name: {dt_model.model_name}")
        print(f"Settings Used: {dt_model.settings}")

        # Display model details (if available)
        try:
            print("\nModel Details (Summary):")
            # print(dt_model.details) # Availability varies by model
        except AttributeError:
            print("'.details' attribute not available for this model.")

    except Exception as train_err:
        print(f"\nError during model training: {train_err}")

3.3. Model Evaluation and Scoring

In-database OML models are used for prediction (scoring) on new data via the .predict(X) and (for classification) .predict_proba(X) methods, also executed in-database. X is an OML DataFrame proxy. Use supplemental_cols to include original columns in the output.  

To evaluate model performance, predictions and actual values are typically pulled into Python (.pull()) and standard libraries like sklearn.metrics are used to calculate accuracy, precision, recall, F1-score, ROC AUC, etc. OML4Py’s oml.metrics module can also be used.

Python

# Script Example 4: Model Scoring and Evaluation
import oml
from sklearn.metrics import accuracy_score, classification_report

# Assuming connection, test_data, and dt_model exist
if not oml.isconnected():
    print("Please connect to the database first.")
elif 'test_data' not in locals() or 'dt_model' not in locals():
     print("Please run the training script (Script 3) first.")
else:
    try:
        test_X = test_data.drop(['species', 'ID']) # Features only
        test_y_actual_proxy = test_data[['ID', 'species']] # Actual values with ID (proxy)

        # Make predictions on test data (in-database)
        # Include ID column to easily join with actual values
        predictions_oml = dt_model.predict(test_X, supplemental_cols=test_data['ID'])
        print("Predictions generated in-database.")

        # Pull predictions and actual values to Python for comparison
        # Pull only necessary columns for performance
        results_oml = predictions_oml.merge(test_y_actual_proxy, on='ID')
        results_df = results_oml.pull() # Pull data into Pandas DataFrame

        print("\nPrediction Results (First 5):")
        print(results_df.head())

        # Calculate accuracy and other metrics using scikit-learn
        accuracy = accuracy_score(results_df['species'], results_df)
        print(f"\nModel Accuracy: {accuracy:.4f}")

        print("\nClassification Report:")
        print(classification_report(results_df['species'], results_df))

        # Predict probabilities (in-database) - for classification models
        probabilities_oml = dt_model.predict_proba(test_X, supplemental_cols=test_data['ID'])
        print("\nProbabilities calculated in-database (First 5):")
        print(probabilities_oml.head(5).pull()) # Pull and show first 5 probabilities

    except Exception as score_err:
        print(f"\nError during scoring/evaluation: {score_err}")

3.4. Embedded Python Execution (EPE)

Embedded Python Execution (EPE) is a powerful OML4Py feature allowing users to run custom Python functions directly within database-managed Python engines. This provides immense flexibility when in-database algorithms are insufficient or when using third-party Python libraries (scikit-learn, statsmodels, PyTorch, etc.) is necessary.  

Key EPE functions:

Additionally, use oml.script.create() to save Python functions to the database script repository for repeated use via oml.script.run() (or within methods like .assign()). Use oml.script.drop() to remove saved scripts.

Python

# Script Example 5: Embedded Python Execution (UDF and Script Management)
import oml
import pandas as pd

# Assuming connection and oml_iris_proxy exist
if not oml.isconnected():
    print("Please connect to the database first.")
else:
    oml_iris_proxy = oml.sync(table='IRIS_TABLE_OML4PY')

    # --- Part 1: Ad-hoc function execution with oml.table_apply ---
    print("--- Embedded Python Execution: oml.table_apply ---")
    # Simple Python function to run in-database
    def calculate_petal_area(df, length_col='petal_length', width_col='petal_width'):
        # This runs in the database Python engine
        # Takes a Pandas DataFrame as input
        import pandas as pd # Imports might be needed inside EPE function
        df_copy = df.copy() # Good practice to avoid modifying original df
        df_copy['petal_area'] = df_copy[length_col] * df_copy[width_col]
        return df_copy[['ID', 'petal_area']] # Return only ID and new column

    # Run the function in-database using oml.table_apply
    try:
        result_oml_df = oml.table_apply(
            oml_iris_proxy[['ID', 'petal_length', 'petal_width']], # Pass only needed columns
            func=calculate_petal_area,
            oml_input_type="pandas.DataFrame", # Specify input type
            parallel=True # Attempt parallel execution (depends on DB config)
        )

        print("Embedded Python function (calculate_petal_area) executed successfully.")
        print("Calculated Petal Area (first 5 rows):")
        print(result_oml_df.head(5).pull())

    except Exception as e:
        print(f"EPE (table_apply) error: {e}")
        print("Note: Ensure EPE is configured correctly with necessary grants (e.g., PYQADMIN).")

    # --- Part 2: Script repository management ---
    print("\n--- Embedded Python Execution: Script Repository ---")
    script_name = 'sepal_area_calculator'
    try:
        # Define function as a string
        func_str = """
import pandas as pd # Imports might be needed inside script

def calc_sepal_area(sepal_length, sepal_width):
    # Simple calculation
    # Inputs can be scalar or pandas Series
    return sepal_length * sepal_width
"""
        # Save function to the database script repository (overwrite if exists)
        oml.script.create(script_name, func_str, global_script=False, overwrite=True)
        print(f"Script '{script_name}' saved/updated successfully.")

        # Calculate a new column using the saved script (via oml.assign)
        oml_iris_with_sepal_area = oml_iris_proxy.assign(
            sepal_area=oml.script.run(
                args={'sepal_length': oml_iris_proxy['sepal_length'],
                      'sepal_width': oml_iris_proxy['sepal_width']},
                func=script_name,
                output_type='float' # Specify return data type
            )
        )
        print("\n'sepal_area' calculated using saved script:")
        print(oml_iris_with_sepal_area[['ID', 'sepal_length', 'sepal_width', 'sepal_area']].head().pull())

        # List saved scripts
        print("\nSaved Scripts:")
        print(oml.script.dir())

        # Drop the script
        oml.script.drop(script_name)
        print(f"\nScript '{script_name}' dropped.")

    except Exception as e:
        print(f"\nScript management error: {e}")
        print("Note: Ensure necessary grants (e.g., PYQADMIN role) for script management.")

3.5. Model and Object Management (Datastore)

OML4Py provides a datastore mechanism to persistently save and load Python objects (like scikit-learn models, DataFrames, lists, dictionaries) within the Oracle Database. This is useful for preserving state between Python sessions, storing trained non-OML models, or sharing objects.  

Note: Models trained using OML4Py’s in-database algorithms (oml.glm, oml.svm, etc.) are already persistent database objects and do not need to be saved to the datastore. The datastore is primarily for non-OML Python objects.

Python

# Script Example 6: Managing Python Objects with Datastore
import oml
import pandas as pd
from sklearn.linear_model import LinearRegression # For example Python model

# Assuming connection is established
if not oml.isconnected():
    print("Please connect to the database first.")
else:
    # Example Python objects
    X_sample = [[21], [22], [23], [24]]
    y_sample = [2, 4.1, 5.9, 8.1] # Slightly noisy data
    py_model = LinearRegression().fit(X_sample, y_sample)
    model_description = "Simple sklearn linear regression model"
    datastore_name = "my_sklearn_models_ds"
    df_sample = pd.DataFrame({'col1': [25, 26], 'col2':})
    objects_to_save = {
        'simple_lr_model': py_model,
        'sample_dataframe': df_sample,
        'info_dict': {'version': '1.0', 'author': 'OML User'}
    }

    # Save objects to datastore
    try:
        oml.ds.save(objs=objects_to_save,
                    name=datastore_name,
                    description=model_description,
                    overwrite=True) # Overwrite if exists
        print(f"Objects saved to datastore '{datastore_name}'.")

        # List available datastores
        print("\nAvailable datastores:")
        print(oml.ds.dir())

        # List objects within the datastore
        print(f"\nObjects in '{datastore_name}':")
        print(oml.ds.dir(datastore_name))

        # Load all objects from the datastore
        loaded_objs = oml.ds.load(name=datastore_name)
        loaded_model = loaded_objs['simple_lr_model']
        loaded_df = loaded_objs['sample_dataframe']
        loaded_info = loaded_objs['info_dict']
        print(f"\nAll objects loaded from '{datastore_name}'.")
        print(f"Loaded model type: {type(loaded_model)}")
        print(f"Loaded DataFrame:\n{loaded_df}")
        print(f"Loaded Info: {loaded_info}")

        # Load only a specific object
        specific_load = oml.ds.load(name=datastore_name, objs=['simple_lr_model'])
        print(f"\nLoaded only 'simple_lr_model', type: {type(specific_load['simple_lr_model'])}")

        # Delete the datastore
        oml.ds.delete(name=datastore_name)
        print(f"\nDatastore '{datastore_name}' deleted.")

        # Verify deletion
        print("\nAvailable datastores after deletion:")
        print(oml.ds.dir())

    except Exception as e:
        print(f"\nDatastore error: {e}")
        print("Note: Ensure necessary grants for datastore operations.")

Table 3.1: Key OML4Py Functions and Purposes

CategoryFunctionPurposeExample Usage SnippetRelevant Snippet(s)
Connectionoml.connect()Establishes connection from Python session to Oracle DB.oml.connect(user='u', password='p', dsn='alias')
Connectionoml.disconnect()Terminates the active database connection.oml.disconnect()
Connectionoml.isconnected()Checks if an active connection exists.if oml.isconnected(): print('Connected')
Data Transferoml.push()Loads Python data (DataFrame, list) to a temporary DB table.oml_temp = oml.push(pandas_df)
Data Transferoml.create()Loads Python data (DataFrame, list) to a persistent DB table.oml.create(pandas_df, table='MY_DATA')
Data Transferoml.sync()Creates OML DataFrame proxy for existing DB table/view/query.oml_proxy = oml.sync(table='EXISTING_TABLE')
Data Transferoml.drop()Drops specified DB table, view, or model.oml.drop(table='OLD_TABLE')
Modelingoml.<algo_name>()Creates model object for specified in-database ML algorithm.dt_model = oml.dt()
Modelingmodel.fit(X, y)Trains model in DB using OML DataFrame proxies (X, y).dt_model.fit(train_X, train_y, case_id='ID')
Modelingmodel.predict(X)Makes predictions in DB on new data (X proxy) using trained model.predictions = dt_model.predict(test_X)
Modelingmodel.predict_proba(X)Makes probability predictions in DB for classification models.probabilities = dt_model.predict_proba(test_X)
EPEoml.do_eval()Runs simple Python UDF in DB, returns result to Python.result = oml.do_eval(func=my_func, arg1=val)
EPEoml.table_apply()Runs UDF taking/returning DataFrame in DB, writes result to DB.res_oml = oml.table_apply(oml_df, func=process_df)
EPEoml.group_apply()Runs UDF on data groups in DB.res_oml = oml.group_apply(oml_df, index='GROUP_COL', func=grp_func)
EPEoml.row_apply()Runs UDF on rows/row chunks in parallel in DB.res_oml = oml.row_apply(oml_df, rows=10, func=row_func)
EPEoml.script.create()Saves Python function to DB script repository.oml.script.create('my_script', func_string)
EPEoml.script.run()Executes saved Python script in DB.oml_df.assign(new_col=oml.script.run(args={}, func='my_script'))
Managementoml.ds.save()Stores Python objects in DB datastore.oml.ds.save(objs={'m': model}, name='ds1')
Managementoml.ds.load()Loads stored Python objects from datastore.loaded = oml.ds.load(name='ds1')
Managementoml.ds.delete()Deletes specified datastore.oml.ds.delete(name='ds1')
Managementoml.ds.dir()Lists available datastores or objects within one.oml.ds.dir() or oml.ds.dir('ds1')

4. Business Value: Benefits and Industry Use Cases

OML4Py 2.1 delivers significant business value beyond its technical capabilities, enabling companies to derive more value from their data.

4.1. Core Enterprise Benefits

4.2. Example Business Scenarios

These scenarios highlight how OML4Py leverages existing Oracle DB investments (storage, security, HA, backup/recovery) for ML workloads, reducing the need for separate, potentially costly and complex ML infrastructures and fostering a more integrated, efficient data strategy.

5. Installation and Setup: Prerequisites and Dependencies

Getting started with OML4Py 2.1 requires meeting specific prerequisites on both the client (where Python runs) and server (Oracle Database) sides.

5.1. Required Python Version and Installation

5.2. OML4Py Library Installation

5.3. Oracle Client Requirements

5.4. Supported Oracle Database Releases and Editions

Table 5.1: OML4Py 2.1 Installation Checklist (Example On-Premises/Linux)

StepRequirement/ActionDetails/NotesRelevant Snippet(s)Status
1Supported OSLinux x86-64 (OL7, OL8, etc.)[ ]
2Supported DB VersionOracle Database 23ai (EE or compatible)[ ]
3Python VersionInstall Python 3.12.x (compile with --enable-shared)[ ]
4Required Linux PackagesInstall perl-Env, libffi-devel, openssl-devel, gcc-c++, etc. (yum/dnf)[ ]
5Oracle ClientInstall Instant Client (Basic/Basic Light)[ ]
6Environment Variables (Client & Server)Set ORACLE_HOME, PYTHONHOME, PATH, LD_LIBRARY_PATH, TNS_ADMIN[ ]
7OML4Py Server Install (on DB Server)Run SQL scripts (pyqcfg.sql) from package as sysdba[ ]
8DB User GrantsGrant OML_DEVELOPER role and other necessary privileges[ ]
9Required Python Packages (Client & Server)Install packages from requirements.txt / requirements2.txt via pip[ ]
10OML4Py Client Install (on Client Machine)Run Perl script (client.pl) from package or use pip[ ]
11Connection TestAttempt oml.connect() from Python client[ ]
12EPE Setup Verification (Optional)Run oml.check_embed() function[ ]

6. Overall Assessment and Conclusion

Oracle Machine Learning for Python (OML4Py) 2.1 represents a significant step in Oracle’s vision of integrating machine learning and AI capabilities directly into its core database platform. It uniquely combines the popularity and power of the Python data science ecosystem with the enterprise-grade security, scalability, and performance of the Oracle Database.

OML4Py 2.1’s Place in the Oracle Ecosystem: OML4Py is a cornerstone of Oracle’s strategy to position the database not just for data storage, but as a central platform for advanced analytics and AI/ML workloads. Its tight integration with Oracle Database 23ai features like AI Vector Search demonstrates its critical role for both traditional ML tasks and modern semantic search and Generative AI applications. Alongside other OML components (OML4SQL, OML4R, OML Services, OML Notebooks), it provides a comprehensive and consistent machine learning experience on the Oracle database.  

Value for Data Scientists and Developers: OML4Py’s greatest value lies in empowering Python users with familiar tools (Python language, Pandas-like API, scikit-learn-like modeling interface) while granting access to Oracle Database’s power. By eliminating time-consuming and error-prone data movement and transformation steps, it allows data scientists and developers to focus directly on modeling and analysis. In-database algorithms offer high performance, while EPE provides the flexibility to use the full Python ecosystem when needed. Innovations like ONNX and Hugging Face support enable the secure use of cutting-edge open-source models and industry standards within the enterprise environment.  

Contribution to Data-Driven Decision Making: OML4Py helps companies derive insights from their data faster and more securely, translating them into action:

Conclusion: OML4Py 2.1 is a mature, capable platform merging Python’s flexibility with Oracle Database’s scalability and security. Enhancements like VECTOR data type support and advanced ONNX/Hugging Face integration make it particularly compelling for modern AI/ML workloads on Oracle Database 23ai. While installation and dependencies require attention, the performance, security, and efficiency benefits gained by eliminating data movement position OML4Py as a strategic tool for accelerating data-driven decisions and fostering innovation within the Oracle ecosystem. Companies can leverage their existing Oracle investments with OML4Py to unlock greater value from their data.

Exit mobile version