Sql-server – Most performant SQL query needed

sql server

I'm storing documents with fields that are set in a DataSource table. The document is written to the database as follows:

Document table (ID, DataSourceID)
Document fields (DocumentID, DataSourceField, FieldValue)

For example:

Document
100 | "Invoice"
101 | "Receipt"

Fields
100 | "Date" | 2011/12/01
100 | "Amount" | 1200,00
101 | "Date" | 2011/12/02
101 | "Warehouse" | "Central"

The documents have many more fields. Now, the question is how to write the fastest query to find documents by entering some filter values on fields (ie. Date between '2011/12/1' and '2011/12/31' and Amount > 1000,00)?

This is what I came up:

select * from Document, Fields 
where DocumentID = FieldDocumentID and DataSource = "Invoice"
and DocumentID in (select FieldDocumentID from Fields where Field = "Date"
and FieldValue between '2011/12/01' and '2011/12/31')
and DocumentID in (select FieldDocumentID from Fields where Field = "Amount"
and FieldValue >= 1000)

I'm not sure how well this is going to work with millions of rows. How could I do this better?

Best Answer

EAV based systems (e.g. Agresso) have query performance problems inherent to the EAV structure. The problems are really based on three underlying issues:

  • The structure is fiddly to query, so you get added complexity in your queries.

  • You can't index the base table on any of the attributes, which drives some inherent inefficiency in the query plans.

  • You have to do multiple joins against a large table.

Some possible strategies for mitigating EAV performance problems are:

  • Using a clustered index to keep all records relating to a given parent in close physical proximity can help somewhat. This minimises the I/O.

  • Consider using the metadata to create a view that flattens out the EAV structure. You may have to redefine the view every time it changes, and it won't necessarily help performance much, but it will be much easier to query. Try to create the view as a roll-up rather than a multiple way join.

  • Any attribute that is not strictly user defined (eg. transaction dates, amounts) should go on the base table. These can also be indexed.

  • Resolve the most selective predicates in subqueries first and join against the sub-query.

  • If you have the option flatten out the physical structure, even if it has to be into a series of fields like 'StringField1', 'StringField2', 'DateField1', 'MoneyField1'. Your application will have to hold the configration data anyway, so you can just map to a field on the database.

    This sounds clumsy, but it's much more efficient to query, and it doesn't have to be in the users' faces. You can also use the metadata to generate a view over the table that names the fields, so other customers of the data can see a meaningful view of the data and query it without necessarily having access to the application metadata.

    If you run out of columns in the table, just add more in the database. The field mapping system can always query the system data dictionary to see what's available.

At some point in these debates, somebody often chips in with a comment about XML columns ("I know, I'll just use XML"). To paraphrase JWZ, now you have two problems. Around 1998 they installed another circle in hell just for people who defile their systems with XML blob fields.

Although it doesn't look slick and high-tech, just having a set of user attribute columns on the table is by far the best solution. It is the most efficient, and there are many ways to make this approach more user friendly at the database level.

EDIT: Here is a snippet of T-SQL showing how to create a view that flattens out an EAV structure. Note that you wound have to write a generator for the view based on the application's attribute metadata and re-create it every time this is changed.

if object_id ('dbo.Foo') is not null
    drop table dbo.Foo
go

if object_id ('dbo.FooView') is not null
    drop view dbo.FooView
go 

create table dbo.Foo (
       FooID     int not null
      ,FooKey    varchar (20)
      ,FooVal    varchar (100)
)

insert Foo (FooID, FooKey, FooVal)
values (1, 'Date', '2011-11-11')

insert Foo (FooID, FooKey, FooVal)
values (1, 'Amount', '100')

insert Foo (FooID, FooKey, FooVal)
values (2, 'Date', '2012-12-12')

insert Foo (FooID, FooKey, FooVal)
values (2, 'Amount', '200')
go

create view dbo.FooView
    as
select *
  from (select FooID
              ,FooKey
              ,FooVal
          from dbo.Foo) keys
 pivot (Max (FooVal) 
        for FooKey in ([Date],[Amount])) pvt
go

select * from dbo.FooView

This is much more efficient than solutions involving multiple self-joins on the table, and will be somewhat efficient with a clustered index on the base entity and attribute type keys. You will probably also want to put type conversions into the view so you can sort the results correctly.