Mongodb – Howto specify the own id value when upload a stream into a gridfs file, using the GridFSBucket API

mongodb-3.6sharding

I'm using TypeScript, MongoDb 3.6 and the mongodb 3.0 driver. Here is how I do it:

this.gfs = new GridFSBucket(this.media_db, {bucketName: 'original'});
const reader = fs.createReadStream(path);
const writer = this.gfs.openUploadStream(result.sha1);
reader.pipe(writer);
console.log("STORE ORIGINAL MongoDb/GridFs")
const waiter = new Promise((resolve, reject) => {
    reader.on('end', ()=>resolve(true));
    reader.on('error', reject);
});
await waiter;

There is a big problem with this setup. Here is my former question about creating indexes on gridfs collections:

GridFS: why can't I shard chunks with a hashed key?

I got an answer that explains how these should be indexed: the file id should have a uniform distribution (e.g. a hashed key). The default id value is not like that, so I must provide my own file ids if I want them to be distributed over multiple nodes from the beginning. The chunk ids are a different story – they are generated objectid values, and they should be left at peace. Using generated objectid values for chunks allows MongoDb to perform a range query on the chunks, resulting in a much more efficient query plan (e.g. when a data stream needs to be reconstructed for a file.)

The question is this: using the official API and the GridFSBucket class, how can I specify my own file id? It is how it should be done, but I don't see any way to do it. Here is the signature of that method:

openUploadStream(filename: string, options?: GridFSBucketOpenUploadStreamOptions): GridFSBucketWriteStream;

It only has a filename parameter and an options parameter. Options is like this:

export interface GridFSBucketOpenUploadStreamOptions {
    chunkSizeBytes?: number,
    metadata?: Object,
    contentType?: string,
    aliases?: Array<string>
}

If this can't be done, then there is no way of avoiding hot shards when mass ulpoading files to gridfs. (Then I probably need to submit an issue.)

Best Answer

The solution is the openUploadStreamWithId method.It has this signature:

openUploadStreamWithId(id: GridFSBucketWriteStreamId, filename: string, options?: GridFSBucketOpenUploadStreamOptions): GridFSBucketWriteStream;

The GridFSBucketWriteStreamId type is like this:

type GridFSBucketWriteStreamId = string | number | Object | ObjectID;

So you can call it like this:

openUploadStreamWithId(your_custom_file_id, your_file_name)

If the custom file id is uniformly distributed and chunks are sharded on (files_id, chunk_id) then this will prevent hot shards, and allow range queries on the chunks at the same time.