MTU lab

Loading CSV with newline in the string field (1/2)

Introduction

In this topic, we will explain the correct way to load CSV with newline (CR+LF) in the string field into Oracle Database. Even if CSV file that includes a hundred million rows, you can be downloaded a source code of the program which can convert it to loadable format in only 95 seconds.

Motivation

To load CSV data into Oracle Database, it is well used an utility, called “SQL*Loader”.
SQL*Loader is excellent to quickly load a large amount of data, But unexpectedly it can not read correctly the CSV with newline in the string field. Even though it is enclosed by two double quotations, it’s not possible.

You might be a sufferer who did not know it is the fact and encountered a mountain of BAD files and error messages at the stage of loading data, finally you may have arrived here seeking relief at the end of Googling.

It is goal of this topic that provides a helpful solution for the readers by passing the CSV through the conversion program without unloading it again.

Reproduction of case

Let’s reproduce under what circumstances the failure occurs.
For example, suppose you have the following CSV data: Let’s name it LOAD_SAMPLE.csv.

"2012/10/21","Newline_is_not_in",85
"2012/11/21","Line_
break_is_before_the_break",86 "2012/12/21","Line_break_is
_after_the_is",87

There are three records and each has three fields that are date, string and number.
Second field as an string on second and third row has newline(exactly CR+LF).
Prepare a table for storing this CSV data in the SCOTT schema. Use the following SQL statement:

CREATE TABLE load_sample ( c1 DATE, c2 VARCHAR2(36), c3 NUMBER(5));

Using following insert statements, you can prepare to the load_sample table the same as LOAD_SAMPLE.csv.

insert into load_sample values ('12-10-21','Newline_is_not_in',85);
insert into load_sample values ('12-11-21','Line_'||chr(13)||chr(10)||'break_is_before_the_break',86);
insert into load_sample values ('12-11-21','Line_break_is'||chr(13)||chr(10)||'_after_the_is',87);
commit;

Create a control file for importing CSV data as follows. Let’s name it LOAD_SAMPLE.ctl

LOAD INFILE "LOAD_SAMPLE.csv"
INTO TABLE "LOAD_SAMPLE"
TRUNCATE FIELDS TERMINATED BY ','
TRAILING NULLCOLS
(  "C1" DATE "YYYY/MM/DD" ENCLOSED BY '"'
,  "C2" CHAR ENCLOSED BY '"'
,  "C3" DECIMAL EXTERNAL
)

Invoke SQL*Loader as follows to capture CSV data into the LOAD_SAMPLE table.

c:\output>sqlldr SCOTT/TIGER@sm201504-001h_mtu01 control=LOAD_SAMPLE

SQL*Loader: Release 11.2.0.2.0 - Production on Mon Sep 3 14:27:40 2018

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

Commit point reached - logical record count 5

The above message looks like no problem at first glance. But would you not notice something strange?
Good insight. Although there were only three CSVs, it is strange that logical record count 5 on the screen. Let's see what happens to the log file.
c:\output>type LOAD_SAMPLE.log

SQL*Loader: Release 11.2.0.2.0 - Production on Mon Sep 3 14:27:40 2018

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

#### Abbreviated the middle ####


   Column Name                  Position   Len  Term Encl Datatype
------------------------------ ---------- ----- ---- ---- ---------------------
"C1"                                FIRST     *   ,    "  DATE YYYY/MM/DD
"C2"                                 NEXT     *   ,    "  CHARACTER
"C3"                                 NEXT     *   ,       CHARACTER

Record 2: Rejected - Error on table "LOAD_SAMPLE", column "C2".
second enclosure string not present
Record 3: Rejected - Error on table "LOAD_SAMPLE", column "C1".
Initial enclosure character not found
Record 4: Rejected - Error on table "LOAD_SAMPLE", column "C2".
second enclosure string not present
Record 5: Rejected - Error on table "LOAD_SAMPLE", column "C1".
Initial enclosure character not found

Table "LOAD_SAMPLE":
  1 Row successfully loaded.
  4 Rows not loaded due to data errors.
  0 Rows not loaded because all WHEN clauses were failed.
  0 Rows not loaded because all fields were null.


Space allocated for bind array:                  49536 bytes(64 rows)
Read   buffer bytes: 1048576

Total logical records skipped:          0
Total logical records read:             5
Total logical records rejected:         4
Total logical records discarded:        0

Run began on Mon Sep 03 14:27:40 2018
Run ended on Mon Sep 03 14:27:41 2018

Elapsed time was:     00:00:00.42
CPU time was:         00:00:00.01

After all, it seems that there was a problem.

Record xx: Rejected - Error on table "LOAD_SAMPLE", column "xx".

Above message was appeared four times, and it turned out that the load was not done correctly.
In the first error message

second enclosure string not present

is shows that although starting to read the string field of the second record, it seems that the enclosing string (double quotation mark) indicating the end of the field was not found.
In other words, newlines in this string field were not part of the data, they were recognized as delimiters separating records.

Strategy review

So is it impossible to load this CSV using SQL*Loader?
No, it is possible. You can load the program by making the program in the next article, converting the CSV format, and adding some rework to the control file.

Before actually creating the program, let's confirm the concept of the solution.
In the Oracle product manual Oracle Database Utilities, there is explain usage of SQL*Loader.
The chapter SQL*Loader Concepts describes the format of records recognized by SQL*Loader. There are three record formats that can be handled.

Record format name Specification on control file
Fixed Record Format INFILE datafile_name "fix n"
Variable Record Format INFILE "datafile_name" "var n"
Stream Record Format INFILE datafile_name ["str terminator_string"]

If the next symbol of datafile_name is omitted (That is, the above error occurred, the case of the control file: LOAD_SAMPLE.ctl), Stream Record Format is selected, on Windows platforms it is "\r\n" or "\r" is selected as terminator_string.

Therefore, you can understand it is impossible using the default stream format in the case of loading CSV with newline in the string field.

To load with conversion only without re-outputting CSV, we will use Variable Record Format. This format identifies the end of the record based on the length of the record contained in the record itself, not CR+LF, so even if CR+LF exists in the middle it will not treat it as a record delimiter.

The figure below shows the control file for dealing with the variable record format and the sample of the record matching it.

For the control file to use the variable record format, put the keyword VAR followed by the number of decimal digits. Define the number of digits assuming the maximum record length. By choosing 10 as shown in the figure, you can correspond to the record length up to the 10th power of 10 bytes.
To represent an actual record in the variable record format, prepare a field of the number of digits specified by the argument of the VAR keyword before the record body, and put the record length in decimal (fill with 0 except valid digits ). It is important to include the number of bytes of CR+LF at the end of the line in the record body as well.
If there is no comma before CR+LF, add a comma for one character.

By adding the changes explained up to this point to the control file and CSV file, you can load the data correctly. Although you can easily modify the control file manually, CSV file must repeat the process of counting record length and adding 10 digit number string to the head of each records. In order to automate this digging work, we provided a conversion program.

See next Loading CSV with newline in the string field (2/2).