Getting multiValued data into Solr via CSV:
The solr documentation describes a "split" function in UpdateCSV. Essentially, it parses a field value using a second CSV parser. See Solr - UpdateCSV - split. The parameters look like so (adjust field name, separator, and encapsulator as necessary):
f.fieldA.split=true&f.fieldA.separator=%2C&f.fieldA.encapsulator='
Getting multiValued data from separate fields to maintain position:
Since asking this question, I've done some reading about dimensional models. It seems that what I was trying to do is poor design, because it places too many expectations on the application, to much complexity in the warehouse, or both.
When trying to preserve the relationships between two field values on a single record, it's better to store them separately as well as together. Here's a comparison of my former input to the new input:
Former CSV input:
name|licenseState|licenseType
Josh|MA,CA|123,456
Fred|MD,OH|789,123
Transformed CSV input:
name|licenseState|licenseType|licenseStateType
Josh|MA,CA|123,456|MA123,CA456
Fred|MD,OH|789,123|MD789,OH123
This way your application can use the licenseState and licenseType dimension values independently, or it can use the licenseStateType dimension values, all without requiring complicated app or warehouse logic.
Since you are indicating insert performance is the primary concern, I'd take the recommendation of the DBA and make the clustering key the identity column since it is a unique, monotonically ascending number, which is guaranteed to (almost) never cause page-splits on the table.
Also, don't store the GUID in an nvarchar(100)
column, use the data type designed for them, which is uniqueidentifier
.
Look carefully at the reporting requirements prior to adding indexes to this table. You may find out you don't need an index on the GUID column at all. If you determine that having an index on the GUID column is required, you may find it desirable to have the leading column of the index not be the GUID, perhaps it will be the Group
or TestID
column instead?
If you've determined, through a rigorous design process, that the majority of reporting queries against this table will be made using the session id as the primary component of joins or where clauses, you may actually benefit from the table being clustered on the session id. You'd likely want to set the page fill factor at some point below 100%, perhaps start at 93% and monitor fragmentation levels. Also, it's worth pointing out that fragmentation may not be such a massive problem if you store this table on SSD. Will the session id be propagated throughout the other related tables? If so, you'll likely need non-clustered indexes on the column in those other tables as well, compounding the fragmentation issue. It may be worthwhile maintaining a single table where you insert the GUIDs to obtain a surrogate integer key to use in the other tables, something like:
IF OBJECT_ID('dbo.SessionSurrogate') IS NOT NULL
DROP TABLE dbo.SessionSurrogate;
CREATE TABLE dbo.SessionSurrogate
(
SessionID int NOT NULL
CONSTRAINT PK_SessionSurrogate
PRIMARY KEY NONCLUSTERED
IDENTITY(1,1)
, SessionGUID uniqueidentifier NOT NULL
);
CREATE UNIQUE INDEX CI_SessionSurrogate --could be non-unique if required
ON dbo.SessionSurrogate(SessionGUID);
IF OBJECT_ID('dbo.UserTestGroups') IS NOT NULL
DROP TABLE dbo.UserTestGroups;
CREATE TABLE dbo.UserTestGroups
(
UserTestGroupsID int NOT NULL
CONSTRAINT PK_UserTestGroups
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SessionID int NOT NULL
, TestID bigint NOT NULL
, [Group] tinyint NOT NULL
, InsertDate datetime NOT NULL
);
GO
CREATE PROCEDURE dbo.InsertSession
(
@SessionGUID uniqueidentifier
, @TestID bigint
, @Group tinyint
)
AS
BEGIN
INSERT INTO dbo.SessionSurrogate (SessionGUID)
OUTPUT INSERTED.SessionID, @TestID, @Group, GETDATE()
INTO dbo.UserTestGroups (SessionID, TestID, [Group], InsertDate)
OUTPUT INSERTED.SessionID
VALUES (@SessionGUID);
END;
GO
You'd then insert data like:
DECLARE @SessGUID uniqueidentifier = NEWID();
DECLARE @TestID bigint = 42;
DECLARE @Group tinyint = 6;
EXEC dbo.InsertSession @SessGUID, @TestID, @Group;
The benefit here is you're only causing fragmentation on a single table with a very narrow row width (20 bytes), offering somewhere around 400 rows per page. The stored procedure above is very compact and outputs the generated int
for you to use later, if necessary.
Best Answer
There are few ways to work around this problem:
As you mention, partitioning will help. If you partition on the leading column in the index and make sure all partitions are equally sized, you can gain super scale. However, the total CPU required to build the index wont be much lower if you partition. But it will save you memory (if you have a slow I/O system, this can be useful)
If this is a traditional data warehouse fact table, you might be better off with a columnstore index
Instead of doing a mega
update
- which is slow. Consider this pattern:This way, you can keep the benefit of the old table being sorted and avoid resorting all the rows (only the ones that have changed).