I’m busy working on performance optimisation in my SSIS packages. I have done quite a few things which have proved very useful indeed. Now, I’m at a point where I load my fact table. I have a series of Lookup components which work on data with count of more than 200million rows. I was wondering if it is possible to split this data so that Lookup exercise can be done in parallel. Below are the two options which I’m not sure about as to which one is best, or whether they work at all.
- To use a single Data Flow Task (DFT) for my Lookups. In that DFT I will have a single OLEDB source component. The next component in that pipeline will be a Conditional Split Transformation where I will define that rows be split into three separate partitions based on date (months). From this point on, all the transformation components will be similar and they will all load into the same destination table even though using different destination components.
- To use three separate Data Flow Tasks. Partition of rows will be performed at source query level. Everything else will be identical except for the WHERE clause in the source queries whereupon partitions are specified.
I would like an option which will make most of parallelism. Reading around from different sources I have come across factors like buffers, threads, and execution trees which should be considered under these circumstances. Unfortunately I’m not quite familiar with those things and for that reason I have decided to post this question for advice.
Many thanks,
Mpumelelo