SQLite: choosing a unique separator

sqlite

I'm working on a project that will result in an SQLite database of about 6 GB of text content (encoded UTF-8). The text will be diverse: it will have a great deal of plain text writing, but also a significant amount of special characters (tildes, backticks, section symbols, mdash's, endash's etc.). There will also be math formulas.

It looks like .import will be the load method.

Question: What could I use as an .separator value that won't be in my text?

I've grepped the future text with a few ideas and have not identified a separator that isn't in the actual content.

I suppose I could escape whatever separators may be in the text. But I prefer to avoid that option if I can.

Best Answer

I was just about to suggest using a multi-character separator or a character from a foreign language (e.g. ) but .separator does not allow multi-character strings or even a multi-byte characters.

I would use a multi-character separator with very low probability to be in the text to import and then parse the document using a custom script. Here is an example in Python3 using $$$$$$ as a separator, but consider strings like [ŠĐć~^˘°˛˙€] if needed.

import.csv

hello $$$$$$ world $$$$$$ 1
foo   $$$$$$ bar   $$$$$$ 2

sqliteimport.py

import sqlite3

import_file_name = "import.csv"
cell_separator = "$$$$$$"
lines = []
insert_query = "INSERT INTO imported (a, b, c) VALUES (?, ?, ?);"
db_connection = sqlite3.connect("database.sqlite")
db_cursor = db_connection.cursor()
db_cursor.execute("CREATE TABLE imported (a TEXT, b TEXT, c INTEGER);")
db_cursor.execute("BEGIN TRANSACTION")
with open(import_file_name, 'r') as import_file:
    for line in import_file:
        # Split each line at each cell_separator into a list of strings.
        # Strip each one of the columns to remove whitespaces.
        cleaned_columns = [column.strip()
                           for column in line.split(cell_separator)]
        db_cursor.execute(insert_query, tuple(cleaned_columns))
db_connection.commit()
db_cursor.execute("VACUUM;")
db_cursor.close()
db_connection.close()
Related Question