Sql-server – Should SSIS packages and SQL database be on same server

Architecturesql serverssis

I'm a bit new to SSIS. I have a lot of 'dev', building, query, tuning performance on SQL Server but am not the master of admin/ overall performance modeling.

So here's my question — Should SSIS packages be run on the same server as the data warehouse (SQL server databases?)

For one, I already get a cryptic error message that certain 'bulk insert' tasks in SSIS can only be run against local databases. Maybe this can be worked around. Maybe it cannot and is much faster, hence put the ETL on the same box as the db.

Second — even if that can be worked around, wouldn't it be faster anyway to have the ETL on the same server as the database in any case? Or barring that, as close as geographically as possible? (ie not different continents).

In particular, the ETL ones run overnight, and the DB is not worked on overnight (possibly backup processes, however) — but I don't foresee much tripping on each other.

If bulk insert tasks can only be done on local DBs, wouldn't the test/ dev environment need to be setup on the DB server as well? Just wondering what is sensible. It's a smaller shop here but we stretch things.

Best Answer

Oh god no, do not use the SQL Server Destination.

At this point, the negligible performance gains are not worth the random error it may throw while loading data. OLE DB Destination is rock solid, provides excellent performance and doesn't suffer from the infernal errors you can run into with the SQL Server Destination. Both components can store data into SQL Server and as a bonus, the OLE DB Destination allows you to run your SSIS package from any server.

SQL Server Destination works by some very clever voodoo, as I understand it. Basically, the SSIS engine lays the data as it would load into SQL Server and whoosh it gets sucked in with a minimal amount of checking on the database side since SSIS super promises, double pinky swears, that the data is good.

The issue that plagued many users in the 2005/2008 days was that something would happen between thinking about inserting the data and actually inserting it wherein the only reported error would be something exceptionally insightful like "The destination component reported an error." However, if you immediately restarted the package - no changes to the source or destination database, the data would load just fine.

As for the architect level questions, if you design your packages using the OLE DB Destination, you can be infinitely more agile than using the SQL Server Destination. What is the one thing SQL Server wants more than anything else for good performance? Memory. The more information it can keep in RAM, the better performance you will have. How long does SQL Server keep that memory? Until it reboots or until things get really bad and it is forced to give back lest the server topple over.

SQL Server Integration Services is a high performance ETL engine. It is able to transform the data very quickly by keeping it all in memory instead of writing it to temporary files or tables. The more rows of data it can load into memory, the more concurrent changes it can process and your throughput is increased. However, the general pattern for SSIS resource consumption is that it's bursty. Sure, it might look like hungry hungry hippos as it starts up consuming everything it sees but it runs - maybe for minutes, maybe hours but as soon as it's done, all the resources are freed and available for general processing.

Since SSIS is an in-memory ETL tool and SQL Server really likes memory, they're gonna fight. Maybe not today. The server has 256 GB of memory, SQL Server is only using 180 so there's a good 30% left for OS level processes and SSIS plays nicely in that space.

Time passes and Oh, the business is growing and we have new web thingers beating on the database all day long and there are now all these analysts are going against the server with Access and pulling all the data just to filter it for "today" once it gets to their desktop. Now SQL Server is asking for more and more memory. You can change the max server memory setting in SQL Server to give a little more to the database engine but at a cost for other OS tasks, like SSIS (it doesn't run in SQL Server's memory space).

At some point, you might run into a situation where you've maxed the server's physical memory and trimmed down everything running on that box that isn't SQL Server and the only thing left is SSIS. So now what?

In an OLE DB Destination world, you stand up another instance (or use an existing, less taxed box), deploy/move your packages from server 1 to 2. As long as permissions and any file system dependencies exist on the new box, you're done.

In a SQL Server Destination world, you have to edit every package and change the destination component, test, go through any change management review process and then deploy.

Something is going to be your limiting factor for performance with SSIS: Memory, Network, Storage, CPU. If you have packages pulling terabytes of data across the wire, only to write kilobytes of data, yeah that subset of packages might need to run as close to the source as possible.

Novice SSIS developers can write some really memory intensive packages. For example, they need to bring together customers from server1.database1 and their orders, also in server1.database1. If they just use the tools, they might have two OLE DB Source components in there that bring all the customers and all the orders into memory. To join the data, they would then need to use a merge component. Oh, but that has a requirement for sorted data. No worries, there's a handy dandy Sort component right there in the toolbox so we add one of those to each of our sources and then wire up the join. At this point, we just pulled all that data into memory, sorted it all in memory and then halved our available memory into the "before the asynchronous/blocking component" and "after the blocking component" Categorizations here

No problem, they rewrite and realize they can add an explicit ORDER BY in the source, and mark it as such, and eliminate two SORTs but they still have to pay the penalty for the join.

Or, someone who actually writes queries sees what they're doing and says "let's use a single query to get this data. Let the database engine use its indexes to make this efficient"