MTU lab

Loading CSV with newline in the string field (2/2)

Preparation

  1. In order to build our CSV convertion program, prepare Microsoft Visual C++ 2013 or later products. If you use Visual C++ Community Edition you can get it from here.
  2. Download the archive file containing the conversion program from here and unzip it to an arbitrary folder.
  3. NOTE:The copyright of the software provided in this article is owned by PLUMSIX Co., Ltd. When dealing with this software you shall comply with the provisions of MIT License.
  4. Run “Visual Studio Command Prompt” included in Visual Studio’s product menu and move to the location where you unzipped the conversion program.
  5. cd /d path_to_program_folder
  6. From the command prompt, issue the nmake command to build the program.
  7. nmake EXE="addlength.exe" OBJS="addlength.obj" CPPFLAGS="/nologo /EHsc /Zi /O2"
  8. Below is a message example when you succeeded in building.
  9. c:\github\addlength>nmake EXE="addlength.exe" OBJS="addlength.obj" CPPFLAGS="/nologo /EHsc /Zi /O2"
    
    Microsoft (R) Program Maintenance Utility Version 10.00.40219.01
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
            cl /nologo /O2 /DNDEBUG /c addlength.cpp
    addlength.cpp
            cl /nologo /O2 /DNDEBUG /Feaddlength.exe addlength.obj
  10. Now you are ready.

Execution

  1. Enter the following command at the command prompt to run the program.
  2. addlength src_file dst_file
    Here, src_file shows the file before conversion and dst_file shows the file after conversion. You can add a folder name before the file name. If you omit the folder name, the current folder will be converted.
  3. It is a case where we actually tried the conversion.
  4. c:\github\addlength>addlength LOAD_SAMPLE_en.csv LOAD_SAMPLE.dat
    Infomation - Completed successfully. Wrote 3 records and 130 bytes in 0.00 seconds.
    
    c:\github\addlength>type LOAD_SAMPLE_en.csv
    "2012/10/21","Newline_is_not_in",85
    "2012/11/21","Line_
    break_is_before_the_break",86
    "2012/12/21","Line_break_is
    _after_the_is",87
    
    c:\github\addlength>type LOAD_SAMPLE.dat
    0000000038"2012/10/21","Newline_is_not_in",85,
    0000000053"2012/11/21","Line_
    break_is_before_the_break",86,
    0000000049"2012/12/21","Line_break_is
    _after_the_is",87,
    
  5. Since the record format has been changed from stream format to variable record format, rewritte the control file accordingly. Change the data file name to LOAD_SAMPLE.dat as well.
  6. c:\github\addlength>type LOAD_SAMPLE.ctl
    LOAD INFILE "LOAD_SAMPLE.dat" "VAR 10"
    INTO TABLE "LOAD_SAMPLE"
    TRUNCATE FIELDS TERMINATED BY ','
    TRAILING NULLCOLS
    (  "C1" DATE "YYYY/MM/DD" ENCLOSED BY '"'
    ,  "C2" CHAR ENCLOSED BY '"'
    ,  "C3" DECIMAL EXTERNAL
    )
    
  7. Let’s see if you can load the converted data using SQL*Loader.
  8. c:\github\addlength>sqlldr SCOTT/TIGER control=LOAD_SAMPLE
    
    SQL*Loader: Release 11.2.0.2.0 - Production on Tue Sep 4 09:37:24 2018
    
    Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
    
    Commit point reached - logical record count 3
    
  9. Let’s check with SQL*Plus whether data has been saved in the table.
  10. c:\github\addlength>sqlplus SCOTT/TIGER
    
    SQL*Plus: Release 11.2.0.2.0 Production on Tue Sep 4 09:38:44 2018
    
    Copyright (c) 1982, 2010, Oracle.  All rights reserved.
    
    
    Connected to:
    Oracle Database 11g Express Edition Release 11.2.0.2.0 - Production
    
    SQL> SET PAGESIZE 1000
    SQL> COLUMN C2 FORMAT A10
    SQL> select * from LOAD_SAMPLE;
    
    C1                 C2                 C3
    ------------------ ---------- ----------
    21-OCT-12          Newline_is         85
                       _not_in
    
    21-NOV-12          Line_              86
                       break_is_b
                       efore_the_
                       break
    
    21-DEC-12          Line_break         87
                       _is
                       _after_the
                       _is
    
    SQL>
    

    You succeeded in loading.

Conclusion

Through this topic you can see the following.

  1. Of the record formats supported by SQL*Loader, stream record format can not be loaded properly if newline char is included in the string type field. Moreover, this is the initial default value.
  2. To load data without re-creating the CSV, it is necessary to convert the data to the variable record format.
  3. For each record, make sure that the record length of n bytes fixed width and the comma character is placed at the end (just before CF + LF) for the data conversion result. When loading a large amount of data it is necessary to help conversion program to reduce time and effort.
  4. Also modify the control file. Add the option “VAR n” to the INFILE clause.

In addition, the provided data conversion program:addlength.cpp is compatible (automatic identification) to UTF-8 in addition to Shift-JIS and has performance that can withstand practical use even for large amounts of data.
The results of investigating the performance were as follows. The hardware used is the same as that used in the product introduction test.

Index Number
Number of lines 100 million
Number of bytes 4.7GB
Time required 95 seconds
Data rate 50.1MBytes/second
c:\github\addlength>addlength output\F1.dat output\F1.csv
Infomation - Completed successfully. Wrote 100000000 records and 5161887577 bytes in 95.04 seconds.

The above is a hard copy of the test screen.

By the way, in our product MTU, if you know that there are string fields including newlines beforehand, you can get the variable record format control file as shown in this topic and CSV data containing the record length can be output.
Since it is often unknown whether a line feed is included in the string field before unloading, the setting to output in variable record format instead of stream record format is set as the factory default value. In addition, fixed record format is also selectable.
The environment variable that controls this behavior is RECLENGTH. It is defined in the env.bat file.

by 開発1号

See previous Loading CSV with newline in the string field (1/2)