302 lines
11 KiB
ReStructuredText
302 lines
11 KiB
ReStructuredText
.. _batchstmnt:
|
|
|
|
******************************************
|
|
Batch Statement Execution and Bulk Loading
|
|
******************************************
|
|
|
|
.. note::
|
|
|
|
**cx_Oracle has a major new release under a new name and homepage**
|
|
`python-oracledb <https://oracle.github.io/python-oracledb/>`__.
|
|
|
|
**New projects should install python-oracledb instead of cx_Oracle.**
|
|
|
|
Inserting or updating multiple rows can be performed efficiently with
|
|
:meth:`Cursor.executemany()`, making it easy to work with large data sets with
|
|
cx_Oracle. This method can significantly outperform repeated calls to
|
|
:meth:`Cursor.execute()` by reducing network transfer costs and database overheads.
|
|
The :meth:`~Cursor.executemany()` method can also be used to execute PL/SQL
|
|
statements multiple times at once.
|
|
|
|
There are examples in the `GitHub examples
|
|
<https://github.com/oracle/python-cx_Oracle/tree/main/samples>`__
|
|
directory.
|
|
|
|
The following tables will be used in the samples that follow:
|
|
|
|
.. code-block:: sql
|
|
|
|
create table ParentTable (
|
|
ParentId number(9) not null,
|
|
Description varchar2(60) not null,
|
|
constraint ParentTable_pk primary key (ParentId)
|
|
);
|
|
|
|
create table ChildTable (
|
|
ChildId number(9) not null,
|
|
ParentId number(9) not null,
|
|
Description varchar2(60) not null,
|
|
constraint ChildTable_pk primary key (ChildId),
|
|
constraint ChildTable_fk foreign key (ParentId)
|
|
references ParentTable
|
|
);
|
|
|
|
|
|
Batch Execution of SQL
|
|
======================
|
|
|
|
The following example inserts five rows into the table ``ParentTable``:
|
|
|
|
.. code-block:: python
|
|
|
|
data = [
|
|
(10, 'Parent 10'),
|
|
(20, 'Parent 20'),
|
|
(30, 'Parent 30'),
|
|
(40, 'Parent 40'),
|
|
(50, 'Parent 50')
|
|
]
|
|
cursor.executemany("insert into ParentTable values (:1, :2)", data)
|
|
|
|
This code requires only one :ref:`round-trip <roundtrips>` from the client to
|
|
the database instead of the five round-trips that would be required for
|
|
repeated calls to :meth:`~Cursor.execute()`. For very large data sets there
|
|
may be an external buffer or network limits to how many rows can be processed,
|
|
so repeated calls to ``executemany()`` may be required. The limits are based
|
|
on both the number of rows being processed as well as the "size" of each row
|
|
that is being processed. Repeated calls to :meth:`~Cursor.executemany()` are
|
|
still better than repeated calls to :meth:`~Cursor.execute()`.
|
|
|
|
|
|
Batch Execution of PL/SQL
|
|
=========================
|
|
|
|
PL/SQL functions and procedures and anonymous PL/SQL blocks can also be called
|
|
using :meth:`~Cursor.executemany()` in order to improve performance. For
|
|
example:
|
|
|
|
.. code-block:: python
|
|
|
|
data = [
|
|
(10, 'Parent 10'),
|
|
(20, 'Parent 20'),
|
|
(30, 'Parent 30'),
|
|
(40, 'Parent 40'),
|
|
(50, 'Parent 50')
|
|
]
|
|
cursor.executemany("begin mypkg.create_parent(:1, :2); end;", data)
|
|
|
|
Note that the ``batcherrors`` parameter (discussed below) cannot be used with
|
|
PL/SQL block execution.
|
|
|
|
|
|
Handling Data Errors
|
|
====================
|
|
|
|
Large datasets may contain some invalid data. When using batch execution as
|
|
discussed above, the entire batch will be discarded if a single error is
|
|
detected, potentially eliminating the performance benefits of batch execution
|
|
and increasing the complexity of the code required to handle those errors. If
|
|
the parameter ``batchErrors`` is set to the value ``True`` when calling
|
|
:meth:`~Cursor.executemany()`, however, processing will continue even if there
|
|
are data errors in some rows, and the rows containing errors can be examined
|
|
afterwards to determine what course the application should take. Note that if
|
|
any errors are detected, a transaction will be started but not committed, even
|
|
if :attr:`Connection.autocommit` is set to ``True``. After examining the errors
|
|
and deciding what to do with them, the application needs to explicitly commit
|
|
or roll back the transaction with :meth:`Connection.commit()` or
|
|
:meth:`Connection.rollback()`, as needed.
|
|
|
|
This example shows how data errors can be identified:
|
|
|
|
.. code-block:: python
|
|
|
|
data = [
|
|
(60, 'Parent 60'),
|
|
(70, 'Parent 70'),
|
|
(70, 'Parent 70 (duplicate)'),
|
|
(80, 'Parent 80'),
|
|
(80, 'Parent 80 (duplicate)'),
|
|
(90, 'Parent 90')
|
|
]
|
|
cursor.executemany("insert into ParentTable values (:1, :2)", data,
|
|
batcherrors=True)
|
|
for error in cursor.getbatcherrors():
|
|
print("Error", error.message, "at row offset", error.offset)
|
|
|
|
The output is::
|
|
|
|
Error ORA-00001: unique constraint (PYTHONDEMO.PARENTTABLE_PK) violated at row offset 2
|
|
Error ORA-00001: unique constraint (PYTHONDEMO.PARENTTABLE_PK) violated at row offset 4
|
|
|
|
The row offset is the index into the array of the data that could not be
|
|
inserted due to errors. The application could choose to commit or rollback the
|
|
other rows that were successfully inserted. Alternatively, it could correct
|
|
the data for the two invalid rows and attempt to insert them again before
|
|
committing.
|
|
|
|
|
|
Identifying Affected Rows
|
|
=========================
|
|
|
|
When executing a DML statement using :meth:`~Cursor.execute()`, the number of
|
|
rows affected can be examined by looking at the attribute
|
|
:attr:`~Cursor.rowcount`. When performing batch executing with
|
|
:meth:`Cursor.executemany()`, however, the row count will return the *total*
|
|
number of rows that were affected. If you want to know the total number of rows
|
|
affected by each row of data that is bound you must set the parameter
|
|
``arraydmlrowcounts`` to ``True``, as shown:
|
|
|
|
.. code-block:: python
|
|
|
|
parent_ids_to_delete = [20, 30, 50]
|
|
cursor.executemany("delete from ChildTable where ParentId = :1",
|
|
[(i,) for i in parent_ids_to_delete],
|
|
arraydmlrowcounts=True)
|
|
row_counts = cursor.getarraydmlrowcounts()
|
|
for parent_id, count in zip(parent_ids_to_delete, row_counts):
|
|
print("Parent ID:", parent_id, "deleted", count, "rows.")
|
|
|
|
Using the data found in the `GitHub samples
|
|
<https://github.com/oracle/python-cx_Oracle/tree/main/samples>`__ the output
|
|
is as follows::
|
|
|
|
Parent ID: 20 deleted 3 rows.
|
|
Parent ID: 30 deleted 2 rows.
|
|
Parent ID: 50 deleted 4 rows.
|
|
|
|
|
|
DML RETURNING
|
|
=============
|
|
|
|
DML statements like INSERT, UPDATE, DELETE and MERGE can return values by using
|
|
the DML RETURNING syntax. A bind variable can be created to accept this data.
|
|
See :ref:`bind` for more information.
|
|
|
|
If, instead of merely deleting the rows as shown in the previous example, you
|
|
also wanted to know some information about each of the rows that were deleted,
|
|
you could use the following code:
|
|
|
|
.. code-block:: python
|
|
|
|
parent_ids_to_delete = [20, 30, 50]
|
|
child_id_var = cursor.var(int, arraysize=len(parent_ids_to_delete))
|
|
cursor.setinputsizes(None, child_id_var)
|
|
cursor.executemany("""
|
|
delete from ChildTable
|
|
where ParentId = :1
|
|
returning ChildId into :2""",
|
|
[(i,) for i in parent_ids_to_delete])
|
|
for ix, parent_id in enumerate(parent_ids_to_delete):
|
|
print("Child IDs deleted for parent ID", parent_id, "are",
|
|
child_id_var.getvalue(ix))
|
|
|
|
The output would then be::
|
|
|
|
Child IDs deleted for parent ID 20 are [1002, 1003, 1004]
|
|
Child IDs deleted for parent ID 30 are [1005, 1006]
|
|
Child IDs deleted for parent ID 50 are [1012, 1013, 1014, 1015]
|
|
|
|
Note that the bind variable created to accept the returned data must have an
|
|
arraysize large enough to hold data for each row that is processed. Also,
|
|
the call to :meth:`Cursor.setinputsizes()` binds this variable immediately so
|
|
that it does not have to be passed in each row of data.
|
|
|
|
|
|
Predefining Memory Areas
|
|
========================
|
|
|
|
When multiple rows of data are being processed there is the possibility that
|
|
the data is not uniform in type and size. In such cases, cx_Oracle makes some
|
|
effort to accommodate such differences. Type determination for each column is
|
|
deferred until a value that is not ``None`` is found in the column's data. If
|
|
all values in a particular column are ``None``, then cx_Oracle assumes the type
|
|
is a string and has a length of 1. cx_Oracle will also adjust the size of the
|
|
buffers used to store strings and bytes when a longer value is encountered in
|
|
the data. These sorts of operations incur overhead as memory has to be
|
|
reallocated and data copied. To eliminate this overhead, using
|
|
:meth:`~Cursor.setinputsizes()` tells cx_Oracle about the type and size of the
|
|
data that is going to be used.
|
|
|
|
Consider the following code:
|
|
|
|
.. code-block:: python
|
|
|
|
data = [
|
|
(110, "Parent 110"),
|
|
(2000, "Parent 2000"),
|
|
(30000, "Parent 30000"),
|
|
(400000, "Parent 400000"),
|
|
(5000000, "Parent 5000000")
|
|
]
|
|
cursor.setinputsizes(None, 20)
|
|
cursor.executemany("""
|
|
insert into ParentTable (ParentId, Description)
|
|
values (:1, :2)""", data)
|
|
|
|
In this example, without the call to :meth:`~Cursor.setinputsizes()`, cx_Oracle
|
|
would perform five allocations of increasing size as it discovered each new,
|
|
longer string. However ``cursor.setinputsizes(None, 20)`` tells cx_Oracle that
|
|
the maximum size of the strings that will be processed is 20 characters. Since
|
|
cx_Oracle allocates memory for each row based on this value, it is best not to
|
|
oversize it. The first parameter of ``None`` tells cx_Oracle that its default
|
|
processing will be sufficient.
|
|
|
|
Loading CSV Files into Oracle Database
|
|
======================================
|
|
|
|
The :meth:`Cursor.executemany()` method and Python's `csv module
|
|
<https://docs.python.org/3/library/csv.html#module-csv>`__ can be used to
|
|
efficiently load CSV (Comma Separated Values) files. For example, consider the
|
|
file ``data.csv``::
|
|
|
|
101,Abel
|
|
154,Baker
|
|
132,Charlie
|
|
199,Delta
|
|
. . .
|
|
|
|
And the schema:
|
|
|
|
.. code-block:: sql
|
|
|
|
create table test (id number, name varchar2(25));
|
|
|
|
Data loading can be done in batches of records since the number of records may
|
|
prevent all data being inserted at once:
|
|
|
|
.. code-block:: python
|
|
|
|
import cx_Oracle
|
|
import csv
|
|
|
|
# Predefine the memory areas to match the table definition.
|
|
# This can improve performance by avoiding memory reallocations.
|
|
# Here, one parameter is passed for each of the columns.
|
|
# "None" is used for the ID column, since the size of NUMBER isn't
|
|
# variable. The "25" matches the maximum expected data size for the
|
|
# NAME column
|
|
cursor.setinputsizes(None, 25)
|
|
|
|
# Adjust the number of rows to be inserted in each iteration
|
|
# to meet your memory and performance requirements
|
|
batch_size = 10000
|
|
|
|
with open('testsp.csv', 'r') as csv_file:
|
|
csv_reader = csv.reader(csv_file, delimiter=',')
|
|
sql = "insert into test (id,name) values (:1, :2)"
|
|
data = []
|
|
for line in csv_reader:
|
|
data.append((line[0], line[1]))
|
|
if len(data) % batch_size == 0:
|
|
cursor.executemany(sql, data)
|
|
data = []
|
|
if data:
|
|
cursor.executemany(sql, data)
|
|
con.commit()
|
|
|
|
|
|
Depending on data sizes and business requirements, database changes such as
|
|
temporarily disabling redo logging on the table, or disabling indexes may also
|
|
be beneficial.
|