I'm using TypeScript, MongoDb 3.6 and the mongodb 3.0 driver. Here is how I do it:
this.gfs = new GridFSBucket(this.media_db, {bucketName: 'original'});
const reader = fs.createReadStream(path);
const writer = this.gfs.openUploadStream(result.sha1);
reader.pipe(writer);
console.log("STORE ORIGINAL MongoDb/GridFs")
const waiter = new Promise((resolve, reject) => {
reader.on('end', ()=>resolve(true));
reader.on('error', reject);
});
await waiter;
There is a big problem with this setup. Here is my former question about creating indexes on gridfs collections:
GridFS: why can't I shard chunks with a hashed key?
I got an answer that explains how these should be indexed: the file id should have a uniform distribution (e.g. a hashed key). The default id value is not like that, so I must provide my own file ids if I want them to be distributed over multiple nodes from the beginning. The chunk ids are a different story – they are generated objectid values, and they should be left at peace. Using generated objectid values for chunks allows MongoDb to perform a range query on the chunks, resulting in a much more efficient query plan (e.g. when a data stream needs to be reconstructed for a file.)
The question is this: using the official API and the GridFSBucket class, how can I specify my own file id? It is how it should be done, but I don't see any way to do it. Here is the signature of that method:
openUploadStream(filename: string, options?: GridFSBucketOpenUploadStreamOptions): GridFSBucketWriteStream;
It only has a filename parameter and an options parameter. Options is like this:
export interface GridFSBucketOpenUploadStreamOptions {
chunkSizeBytes?: number,
metadata?: Object,
contentType?: string,
aliases?: Array<string>
}
If this can't be done, then there is no way of avoiding hot shards when mass ulpoading files to gridfs. (Then I probably need to submit an issue.)
Best Answer
The solution is the openUploadStreamWithId method.It has this signature:
The GridFSBucketWriteStreamId type is like this:
So you can call it like this:
openUploadStreamWithId(your_custom_file_id, your_file_name)
If the custom file id is uniformly distributed and chunks are sharded on (files_id, chunk_id) then this will prevent hot shards, and allow range queries on the chunks at the same time.