Overview
This version of aCOPY FROM
statement, allows you to specify the imported file format and currently three types are supported.
With each file type there will be differences in performance and behavior
- CSV (comma-separated values): simple columnar text format
- ORC (optimized row columnar): columnar storage format developed by Apache
- Parquet (Apache Parquet): columnar data storage used in Apache Hadoop ecosystem
Each query without a specified file format assumes to be importing a CSV file (There is no format detection in place)
Syntax
In order to sepcify the file format using theCOPY FROM
statement you can use the following syntax:
Format name is case insensitive
Examples
When copying from the CSV, ORC or Parquet formats, as the first step you need to create a destination table:ORC
Copying a dataset from an ORC format file is only supported if the ORC file resides in an object storage solution, such as an S3 bucket.- Use
HEADER
if your ORC file includes column headers
Parquet
Copying a dataset from a parquet format file is only supported if the parquet file resides in an object storage solution, such as an S3 bucket.Copying from Parquet files is memory consuming. Files bigger than a few gigabytes might result in out of memory error
CSV
When it comes to CSV files, we have a few cases here:- CSV
- CSV with Specified Delimiter
- CSV skipping invalid rows
on_error
action to ignore
enforce further processing after skipping invalid rows.
Differences in Behavior
-
Ignored Options
HEADER
,DELIMITER
,NULL
,ON_ERROR
options are ignored not affecting the execution of the queries for formats different than CSV
-
Null Values Handling
- All ORC files have nullable columns. In order to import a nullable column to an Oxla column, which are described as
NOT NULL
, the column in the ORC file cannot contain a null value, otherwise the request will be terminated. - For Parquet files, inserting a nullable column to a non nullable column is allowed as long as there are no null values in the source column.
- When using
ON_ERROR
option for CSV files, null constraint violation causes row skipping
- All ORC files have nullable columns. In order to import a nullable column to an Oxla column, which are described as
-
Column Matching
- ORC and Parquet files are only being matched based on column index, while CSVs can be matched both with names or indexes. For more information regarding that, please refer to our COPY FROM with HEADER doc