Running a Bulk Load

This section summarizes the steps required to run a bulk load into a Yellowbrick table. For more details, see ybload Command and ybload Options.

To run a bulk load:

  1. Make sure the target Yellowbrick database is up and running and the ybload client is installed (as part of the ybtools package).
  2. Create the table in the database before running the load operation. Use the CREATE TABLE statement (or CREATE TABLE AS or SELECT INTO).
  3. Determine the exact location and format of the source file (or set of source files) for the load. The following source types are supported:
    • Files local to the client system where you are running ybload
    • stdin
    • Named pipes
    • Amazon Simple Storage Service (Amazon S3) files (See Loading a Table from Amazon S3.)
    Note: Try to export the data from the source system in the character set that you used when you created the Yellowbrick database. For example, load exported UTF8 data into a UTF8 database. Any required character set translation is likely to have an impact on import performance during the load.
  4. From a client system where the ybtools clients are installed, run the ybload command, as a user account that is not a superuser. For example, the default yellowbrick user is a superuser. Superuser accounts may not have sufficient memory to run bulk load operations.
    You must define a few essential pieces of information. On an as-needed basis, you can also specify a range of other processing, logging, and parsing options.
    ybload [options] -t [SCHEMA.]TABLENAME SOURCE [SOURCE]...
    • [options]: You will need to set some database connection values if they are not already set with environment variables. Other options depend on the nature of the data and related formatting requirements for the load. Consider using the --dryrun option for a new load operation; this will help you determine the options you need in order to load the table successfully. For details, see ybload Options or the online help text.
      Important: Setting the --format option is strongly recommended for all load operations.
    • TABLENAME: the target table, including its schema name (optional but recommended). If you do not specify the schema name, the table is assumed to be in the public schema. The schema of the target table is not based on the user's search_path, regardless of how it is set.
      Important: The user who runs the load must have INSERT permissions on the table (but does not have to own the table).
    • SOURCEFILE: one or more source files and their path (locally or on an S3 site). Each source file should be a flat file with a consistent field delimiter (comma-separated or tab-delimited, for example). Use spaces to separate multiple file references. You can also use wildcard characters (such as *) to refer to multiple files or a whole directory of files. Single quotes around source file names are optional.
      Compressed gzip, bzip2, xz, pack200, and lz4 formats may be detected based on their file extensions:
      • .gz = gzip
      • .bzip2 or .bz2 = bzip
      • .xz = xz
      • .pack200 = pack200
      • .lz4 = lz4
      ybload can also detect compression headers within regular compressed files, but the file extension mapping overrides this header detection. The --source-compression option, if specified, overrides both of these detection methods.
      Note: lz4 compression is not as compact as gzip but takes less time to both compress and decompress.

      At a minimum, each source file should contain fields for every non-nullable column in the table. Files do not have to match the target table exactly, one to one, in terms of fields mapping to columns and the order of the fields. Options are available to work around these mismatches. If the fields in your source file do not align with the columns in the target table, use either --field-names or --parse-header-line to map fields to columns.

      Note: To load columns with DEFAULT values, use the --field-names option and omit those columns from the list. See ybload Options.

      Empty source files do not prevent bulk loads from proceeding. If ybload encounters an empty source file, the load operation skips to the next file, if any.

    • To load from stdin (standard input, an unnamed pipe), enter the single-dash character (-) at the end of the ybload command. This must be the last character on the command line, and it must be prefixed with -- to indicate that option parsing is complete. For example: ./ybload -d premdb --username bobr -t match -- -

    • If your workflow involves multiple processes communicating through named pipes, ybload can consume directly from your producer processes. Simply specify the named pipes on the ybload command line in the same way that you would name regular files. Mixing pipe and non-pipe data sources for a load is not recommended; this could cause a deadlock with the program that the pipes communicate with.