Postgresql – How to import the _id index from a MongoDB collection as a hash index for a PostgreSQL table (from a CSV export from MongoDB)

backupdatabase-designmigrationmongodbpostgresql

I want to be able to keep track of the _id index values that get the MongoDB collection uses when I export the MongoDB collection (via CSV) and import the data into a PostgreSQL DB table (basically for archiving older data not frequently reported on from a MongoDB collection to a PostgreSQL DB table).

That said, if I ever wanted to add the records back into MongoDB for any reason, I want to ensure that the _id MongoDB collection values stay aligned with other collections in MongoDB and tables in PostgreSQL.

MongoDB uses a BSON object 24 character hash for its index (_id) and I want to be able to archive the data from MongoDBto a PostgreSQL DB using that _id value. It looks like it's possible to set a PostgreSQL table column datatype to character(24) and set it to the primary key as a hash using CREATE INDEX idx_hash ON 'pg_table_name_goes_here' USING HASH (_id); rather than requiring it to be a BSON like I originally though I would have to do, but will that work in a typical PostgreSQL table JOIN? and is there a better way to do this?

To put it another way, I'm taking a fair amount of data (about 5 to 8+ Million records a month) that is originating from a MongoDB and trying to archive previous/older data into PostgreSQL. So I want to keep that archived data in PostgreSQL exactly (or as close as possible) the same, in the event I might want to move the archived PostgreSQL back into some kind of NoSQL datastore in the future. In the meantime, I want to make querying the data in PostgreSQL as fast as possible. I want identical index columns / values.

Is this possible (using PostgreSQL v12 and MongoDB v4.4)?

I've been going down some pretty deep rabbit holes and finding a lot of weird and bizarre things about PostgreSQL. I've traditionally used the MySQL family of Relational DBs and PostgreSQL has some great features but it's been a little more awkward to get the hang of than I expected. I stumbled across this video series, and it makes sense: idx and seems to fit my use case as well as line up with your above comment. FWIW I did find this: pgbson but the complexity for deployments doesn't seem worth it.

Best Answer

MongoDB uses a BSON object 24 character hash for its index (_id)

Values are stored in columns in a relational database, not "in an index". To store such a character (aka "string") value, create a column of type varchar(24) to hold the _id value from MongoDB.

If you want to ensure uniqueness, then create a unique B-Tree index on the column or defined it as the primary key.

Something along the lines:

create table mongo_data
(
  id varchar(24) primary key, 
  ... other columns ...
);

The fact that the value is computed by a "hash algorithm" in MongoDB has nothing to do with a "hash index" in Postgres.