Loading multiValued fields into Solr via flat file, and possibly value position preservation in those fields

A hobby web project of mine uses data from many sources, some flat, some highly normalized. My ETL process boils all sources down to one logical table (which is soon to include some multi-valued fields). The final table is loaded into a few separate data presentation environments (Solr or MySQL based). My Solr environment demands a denormalized model (it just loads the transformed output table directly as a flat file), and my MySQL environments can each use a highly normalized model (loading involves some normalization). Both presentation environment load processes are very simple right now.

To facilitate the use of flat files throughout my ETL process (important because I use a variety of tools to summarize and cleanse the data, and make heavy use of pipelines for performance reasons), I've been able to reduce multi-valued fields from my sources to a single field using grouping or best candidate selection.

Now I'm designing for the multi-valued fields I need to preserve in my output. The MySQL load process is easy because I've already normalized it with these fields in mind. But I'm having trouble with the Solr part:

I have zero experience with "multiValued" fields in Solr
the flat file Load process is currently very simple, I'd like to preserve as much simplicity as possible
my multiValued fields must be coordinated by position within a record, e.g.
- record X, field A, position 3 is a license #
- record X, field B, position 3 is the corresponding state for that license

How might I design my ETL for the multi-valued fields I plan to preserve?

Some ideas I've come up with that I'm hitting roadblocks on:

introduce an "additive" record using the same document key that occupies its own row in the flat file
- no idea where to start on this one
pre-assemble values into a single field, parse using Solr, e.g.
- {[[123],[456]];[[MA],[CA]]} -> "123,MA;456,CA"
- not sure if Solr can unpack a string and preserve position info appropriately across fields

I've been unable to find information about coordinating multiValued field values from separate fields. Allowing for the fact that this might not be natively supported by Solr, I'm still interested in solutions to loading multiValued fields into Solr in a way that's nearly as simple as loading a flat file.

Best Answer

Getting multiValued data into Solr via CSV:

The solr documentation describes a "split" function in UpdateCSV. Essentially, it parses a field value using a second CSV parser. See Solr - UpdateCSV - split. The parameters look like so (adjust field name, separator, and encapsulator as necessary):

f.fieldA.split=true&f.fieldA.separator=%2C&f.fieldA.encapsulator='

Getting multiValued data from separate fields to maintain position:

Since asking this question, I've done some reading about dimensional models. It seems that what I was trying to do is poor design, because it places too many expectations on the application, to much complexity in the warehouse, or both.

When trying to preserve the relationships between two field values on a single record, it's better to store them separately as well as together. Here's a comparison of my former input to the new input:

Former CSV input:

name|licenseState|licenseType
Josh|MA,CA|123,456
Fred|MD,OH|789,123

Transformed CSV input:

name|licenseState|licenseType|licenseStateType
Josh|MA,CA|123,456|MA123,CA456
Fred|MD,OH|789,123|MD789,OH123

This way your application can use the licenseState and licenseType dimension values independently, or it can use the licenseStateType dimension values, all without requiring complicated app or warehouse logic.

Best Answer

Related Solutions

ETL Best Practices – Data Transfer Between OLTP Systems

Sql-server – What options exist for copying data between servers on a regular basis

Related Question