Move Commonly-Repeated Fields into a new Table

duplicationforeign keynormalization

I'm still at the design stage for a new system, and I've noticed there are several places where different types of object share similar fields.

In this case, we have customers, offices, warehouses etc. that have address fields like Address1, Address2, ... Postcode. So, I was strongly considering making a new Addresses table with all these fields in, and an ID, and linking to that with foreign keys for all of the different types that contain addresses. The same thing could be done for groups of telephone/fax numbers too.

To illustrate this point, here's the current situation in an abstract form…

Customers        Offices
- CustomerID     - OfficeID
- FirstName      - OfficeName
- Surname        - Telephone
- Address1       - Address1
- Address2       - Address2
- Address3       - Address3
- Address4       - Address4
- Postcode       - Postcode

…and after the change…

Customers        Offices         Addresses
- CustomerID     - OfficeID      - AddressID
- FirstName      - OfficeName    - Address1
- Surname        - Telephone     - Address3
- AddressID      - AddressID     - Address3
                                 - Address4
                                 - Postcode

The objective here isn't really normalization (I'm not intending to eliminate duplicate addresses here, even though it may be possible) but to help keep similar subtypes of data consistent. It also means that in the model code, I can treat all addresses in the same with with one class, whatever they belong to. I hope that the overhead in joining these tables wouldn't be that high, especially since I don't always need the addresses for all use cases.

Your opinions/experiences please!

Best Answer

Yes, move it. The extra join overhead will be minimal (if the right indexes are in place!) and you will get a performance improvement for all the other queries which don't use addresses because you will get more rows per page in Customers and Offices.

Down the road, any master data management (MDM) or data clean-up will be easier with the values all in one place.

Sticky points to look out for are common definitions of column semantics (is Address4 the city, say, for absolutely every possible address?). This could be important for reporting, for example, or summarising sales by region. Uniqueness could be difficult to define for any natural keys you may choose to create on Addresses.

Related Solutions

Loading multiValued fields into Solr via flat file, and possibly value position preservation in those fields

Getting multiValued data into Solr via CSV:

The solr documentation describes a "split" function in UpdateCSV. Essentially, it parses a field value using a second CSV parser. See Solr - UpdateCSV - split. The parameters look like so (adjust field name, separator, and encapsulator as necessary):

f.fieldA.split=true&f.fieldA.separator=%2C&f.fieldA.encapsulator='

Getting multiValued data from separate fields to maintain position:

Since asking this question, I've done some reading about dimensional models. It seems that what I was trying to do is poor design, because it places too many expectations on the application, to much complexity in the warehouse, or both.

When trying to preserve the relationships between two field values on a single record, it's better to store them separately as well as together. Here's a comparison of my former input to the new input:

Former CSV input:

name|licenseState|licenseType
Josh|MA,CA|123,456
Fred|MD,OH|789,123

Transformed CSV input:

name|licenseState|licenseType|licenseStateType
Josh|MA,CA|123,456|MA123,CA456
Fred|MD,OH|789,123|MD789,OH123

This way your application can use the licenseState and licenseType dimension values independently, or it can use the licenseStateType dimension values, all without requiring complicated app or warehouse logic.

Database Schema – Creating a Table with Many Optional Fields

Your second option is more flexible, but I'm not sure why you're worried about "tons more tables". Usually this would be done with a single table:

contact_types
-------------
  id (PK)
  name

contact_details
---------------
  id
  contact_type_id (FK to contact_types.id)
  value

organization_contacts
---------------------
  id (PK)
  contact_detail_id (FK to contact_details)
  organization_id (FK to organizations)

populated like this:

contact_types
-------------
ID  | name
----+-----------
1   | web_url
2   | facebook_url
3   | phone_1

contact_details
---------------
id  | type_id  | value
----+----------+-------
1   | 1        | www.stuff.com
2   | 3        | (111) 111-1111
3   | 2        | facebook.com/?profileid=stuff

organization_contacts
---------------------
id | contact_detail_id  | organization_id
---+--------------------+----------------
1  | 1                  | 1
2  | 2                  | 1
3  | 3                  | 1

This schema only has 3 tables (not "tons") and you can have as many contact types as you want. The joins aren't that complicated. Your contact_details table will be large as there is 1 record per piece of contact info, but unless you have "tons" of contact info for each organization and "tons" or organizations, this probably won't be too big a problem. ;)

You could also have a contact_details table which stores all of the fields. Something like this:

contact_details
---------------
  id (PK)
  main_email
  secondary_email
  web_url
  facebook_url
  linkedin_url
  myspace_url
  street_address_line_1
  street_address_line_2
  street_address_line_3
  city
  prov_state
  country
  postal_code
  phone_num_1
  phone_num_2
  fax_num

This structure is much simpler but is more static. If you don't plan to change your set of contact data very much, and you think that most records will have most (or above a certain threshold of) fields filled out, I suspect this would perform better.

Best Answer

Related Solutions

Loading multiValued fields into Solr via flat file, and possibly value position preservation in those fields

Database Schema – Creating a Table with Many Optional Fields

Related Question