So, after loop finishes, the results of db.collection.getShardDistribution() shows that all chunks are located on only one of shards, after few minutes in fact chunks are distributed evenly between shards. Here is my question: should not mongos distribute chunks between shards during execution of loop, instead of directing them all to one shard?
The issue is that you are using a monotonically increasing shard key. This will result in all inserts targeting the single shard (aka a "hot shard") that currently has the chunk range representing the highest shard key value. Data will eventually be rebalanced to other shards, but this is not an effective use of sharding for write scaling. You need to choose a more appropriate shard key. If your use case does not require range queries, you could also consider using a hashed shard key on the address
field. See Hashed vs Ranged Sharding for an illustration of the expected outcomes.
Another related case: I try to import large MongoDB database in bson using mongorestore. I'm doing it from outside of Docker network: mongorestore --host 127.19.0.150:3300 -d import1 -c test /path/base.bson The import works well, but all the chunks are located in one of shards.
If the outcome is similar (all inserts going to a single shard and then being rebalanced), this also suggests a poor shard key choice.
If you are bulk inserting into an empty sharded collection there is an approach you can use to minimize rebalancing: pre-splitting chunk ranges based on the known distribution of shard key values in your existing data.
The database data schema consists of multiple fields, I have chosen one field with Int32 datatype as shard key, but it's cardinality is very low, 15% of documents have the same value for it, could this be a source of it?
Low cardinality shard keys will definitely cause issues with data distribution. The most granular chunk range possible will represent a single shard key value. If a large percentage of your documents share the same shard key values, those will eventually lead to indivisible jumbo chunks
which will be ignored by the balancer.
All containers run on computer with 32GB of RAM and i7-6700HQ, could the slow HDD be a bottleneck resulting in such a slow chunks migration?
There isn't enough information to determine if your disk is the most limiting factor, but running a sharded cluster on a single computer with a slow HDD will certainly add resource contention challenges. Choosing appropriate shard keys should minimize the need for data migration unless you are adding or removing shards for your deployment.
Assuming you are using a recent version of MongoDB with the WiredTiger storage engine as default (MongoDB 3.2+), you will definitely want to explicitly set --wiredTigerCacheSizeGB
to limit the internal cache size for mongod
instances. See: To what size should I set the WiredTiger internal cache?.
No, forget LIKE
operator for this, conditions as first_name LIKE '%name%'
will never be effective. A regular index can not be used for range scans for conditions as first_name LIKE '%name%'
. If first_name
is indexed, index full scan is the best the database can use. And you still need to handle case-sensitity.
(index range scan can be used for conditions as first_name like 'name%'
, where the beginning of the value is fixed)
This is a task for Oracle Text.
No, forget virtual columns, you can not create text indexes on virtual columns, this is what you get if you try it:
DRG-11304: function-based indexes are not supported by this indextype
I have answered a similar question: Oracle “matches” multiple columns
But here it is again, with an example:
create table t1
(
id number primary key,
code number not null,
first_name varchar2(100 char) not null,
last_name varchar2(100 char) not null,
email varchar2(50 char) not null
);
insert into t1 values (1, 111, 'Balazs', 'Papp', 'dba@stackexchange.com');
insert into t1 values (2, 222, 'linux', 'unil', 'index@stackexchange.com');
commit;
begin
ctx_ddl.create_preference('t1_multi_column_datastore', 'multi_column_datastore');
ctx_ddl.set_attribute('t1_multi_column_datastore', 'columns', 'code, first_name, last_name, email');
ctx_ddl.create_preference('t1_lexer', 'basic_lexer');
ctx_ddl.set_attribute('t1_lexer', 'mixed_case','no');
end;
/
create index i1 on t1 (first_name) indextype is ctxsys.context
parameters ('lexer t1_lexer datastore t1_multi_column_datastore');
Then the query:
SQL> select * from t1 where contains (first_name, '%STACK%') > 0;
ID CODE FIRST_NAME LAST_NAME EMAIL
---------- ---------- ---------- ---------- -------------------------
1 111 Balazs Papp dba@stackexchange.com
2 222 linux unil index@stackexchange.com
SQL> select * from t1 where contains (first_name, '%11%') > 0;
ID CODE FIRST_NAME LAST_NAME EMAIL
---------- ---------- ---------- ---------- -------------------------
1 111 Balazs Papp dba@stackexchange.com
SQL> select * from t1 where contains (first_name, '%Uni%') > 0;
ID CODE FIRST_NAME LAST_NAME EMAIL
---------- ---------- ---------- ---------- -------------------------
2 222 linux unil index@stackexchange.com
Notice how I did not specify all the columns, still the query searched in all of them (enumerated in the multi_column_datastore
). I specified first_name
in the index and the query, yet the query returned records that matched on the email
, code
or last_name
column.
Notice how I did not specify anything related to case-sensitivity in the query. It was specified with the mixed_case=no
property (Oracle Text word searches are not case-insensitive by default: Case-Sensitive Searching).
And the query uses the index:
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------
SQL_ID a4vw67psmh4x3, child number 0
-------------------------------------
select * from t1 where contains (first_name, '%Uni%') > 0
Plan hash value: 1218792127
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 4 (100)| |
| 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 544 | 4 (0)| 00:00:01 |
|* 2 | DOMAIN INDEX | I1 | | | 4 (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("CTXSYS"."CONTAINS"("FIRST_NAME",'%Uni%')>0)
Best Answer
MongoDB can take advantage of a compound index when you query for the leading column(s), so in your case the index on userId and eventDate can be used for a query on userId. You don't need a separate index.