Programming4us
         
 
 
SQL Server

Microsoft SQL Server 2008 Analysis Services : Managing processing - Partition processing & Managing processing with Integration Services

6/21/2012 4:32:40 PM
During development we can process cubes or dimensions whenever we want; in a production environment processing needs more thought though. On a very simple project we might be able to get away with doing a full process on our entire Analysis Services database if it only takes a few minutes to complete. However, we usually have large data volumes to manage and a limited amount of time to perform any processing, so a full process simply isn't feasible—we need to think carefully how we can only process the data that needs to be processed, and do that processing in the most efficient way possible.

Analysis Services processing can be broken down into two different tasks:

  • Dimension Processing involves loading data into a dimension and building indexes on it.

  • Partition Processing is more complex. Before we can query a cube, all of the dimensions in the cube need to be processed and we need to process the cube itself too. A cube is made up of measure groups and a measure group is made up of partitions, and since partitions are where the data in a cube is actually stored then when we talk about processing a cube what we are really talking about is processing all of the partitions in a cube. Processing a partition involves loading data into it, building indexes, and building any aggregations specified in the aggregation design associated with this partition.

MOLAP, HOLAP, or ROLAP?

So far we've assumed that we're always going to use MOLAP storage mode for all of our partitions. That's because 99% of the time we will be! MOLAP storage involves storing all data in Analysis Services' own data structures; it gives us the fastest query performance but involves the longest processing times. ROLAP storage leaves all data in the relational database, and when MDX queries are run then Analysis Services generates SQL statements to retrieve the data it needs; querying is relatively slow as a result but processing is very fast. HOLAP storage is a cross between the two and involves storing aggregations and indexes in MOLAP mode but everything else in ROLAP mode.

ROLAP storage for partitions is only very rarely useful, for when we really do need to see real-time data inside a cube, and even then the poor query performance of ROLAP partitions means it can only be used with relatively small data volumes. Since fast query performance is the number one priority for any Analysis Services cube the extra time needed for processing MOLAP partitions is usually a price worth paying. We have never seen a scenario where HOLAP storage is the right choice, so we can safely ignore it.

Similarly, MOLAP storage is almost always the right choice for dimensions. 


Dimension processing

In this section we'll discuss the various different types of processing that can be performed on a dimension and what they actually do. For obvious reasons we'll want to choose the option that loads the data we need into the dimension in the quickest and most efficient way. However, when processing a dimension we also need to be aware of the side effects that this processing might have on any cubes that the dimension is present in. If, for example, we perform a Process Update on a dimension, then this might invalidate aggregations that include hierarchies from this dimension, meaning they need to be rebuilt too. In this case and others, the side effects are more significant than dimension processing itself: the processing of a single dimension is usually a relatively fast operation while rebuilding aggregations can take much longer.

Clicking on the Impact Analysis button in the Process dialog in either SQL Management Studio or BI Development Studio will list all of the objects that are affected by any type of processing on any object.


Here is a complete list of all of the options we have for dimension processing, along with a brief description of what each one does and any side effects:

Type Description
Full The dimension's structure is deleted and rebuilt, and all dimension data is reloaded from scratch. In addition, all measure groups related to this dimension, and all partitions in these measure groups, require a Full process before they can be queried.
Add Compares the contents of the relational dimension table with the contents of the Analysis Services dimension, adds any new members that are found but does not delete or update existing members: this means that you run the risk that important changes to existing dimension members are not made. It does not invalidate existing partition data or aggregations. Also known as incremental processing, this option is not visible in SQL Management Studio or BI Development Studio. For more information on how to use process Add, see http://tinyurl.com/gregprocessadd.
Data Loads data into the dimension but does not process dimension indexes. It has the same side effects as Process Full.
Index Usually used after Process Data, it builds the bitmap indexes of the dimension and does not require any access to the relational database.
Type Description
Update Compares the contents of the relational dimension table with the contents of the Analysis Services dimension and then deletes any members from the Analysis Services dimension that no longer exist in the relational table, updates any members that have changed and adds any new members.

Since existing members on hierarchies with at least one flexible attribute relationship in the chain of attribute relationships down to the key attribute can change, then all aggregations including these hierarchies will be dropped. They can be rebuilt later by running a Process Default on affected cubes or measure groups.

Over time the performance of a dimension will degrade after multiple Process Updates. As a result, if you do use Process Update regularly it's a good idea to schedule an occasional Process Full, for example at a weekend or public holiday when it will cause minimal disruption.
Unprocess Deletes all structures and data for the dimension.
Default Performs whatever processing is necessary to bring the dimension up to a fully processed state.

With these options in mind, here, in order of increasing complexity, are some of the possible strategies we can use for processing dimensions:

  • We can run a Process Full on all of our dimensions, which of course means we then have to run a Process Full on all of our cubes. As we have already said this is the simplest option but is only feasible if we have enough time to do this—and can be sure that we will have enough time in the future as the data volumes in our data warehouse grow. Remember that adding hardware can have a dramatic impact on processing performance, and may be cheaper than developing and maintaining a more complex processing solution.

  • We can run a Process Update on any dimensions that have changed their contents, and then run a Process Default on our cube to rebuild any invalidated aggregations.

  • If existing members on dimensions never change we can run a Process Add to add any new members that appear. Although Process Add can be much more difficult to configure than Process Update, it can perform significantly better and in scenarios where we are only making Type 2 changes to a dimension then it is safe to use; the fact that it does not invalidate existing partition data or aggregations is also very important.

  • If existing members do change but we don't mind waiting for these changes to appear in the Analysis Services dimension, then we can run regular Process Adds and then run a Process Update or even a Process Full on a less regular basis. This can, for example, be useful if we need to add new members to a dimension during the day with minimal impact to our users but can wait until our nightly processing window for our dimensions to be completely updated.

For very large dimensions consisting of several million members or a large number of attributes, splitting a Process Full up into separate Process Data and Process Index operations may allow us to make better use of available memory. Whether it does or not will depend on the structure of our dimension and our server's hardware; thorough testing is necessary to determine whether it is worth doing this. Similarly,  setting the AttributeHierarchyOptimizedState and AttributeHierarchyOrdered properties to False may also improve processing performance at the expense of, respectively, query performance and member ordering.

Last of all, remember that many structural changes to a dimension (for example adding a new hierarchy) will mean that a Process Full is necessary. As a result such structural changes will need to be carefully planned for.

Partition processing

As with dimensions, partitions can be processed in several ways. Processing a large partition is a very slow, CPU intensive operation and as a result partition processing normally takes longer than dimension processing; we therefore have to think more carefully about how to process our partitions in the most efficient way. Luckily, partition processing is also a lot less complex than dimension processing in that processing a partition has no side effects on any other objects.

The following table lists the different options for processing a partition:

Type Description
Full The partition is completely rebuilt. All data inside the partitions is deleted and reloaded from the relational database; indexes and aggregations are rebuilt. It is equivalent to an Unprocess, followed by a Process Data, followed by a Process Index.
Add Also known as incremental processing, this adds new data to the partition. After we've specified a table or SQL query containing new fact rows, Analysis Services creates a new partition, fully processes it, and then merges the new partition with the existing one. Over time the performance of a partition will degrade after multiple Process Adds. Once again it's a good idea to schedule an occasional Process Full if we are using Process Add regularly.
Type Description
Data Behaves the same as Process Full but does not process indexes and aggregations.
Index Builds indexes and aggregations for the partition. This is normally used after a Process Data.
Default Does whatever processing is necessary to bring the partition up to a fully processed state. If the partition itself is in an unprocessed state, then a Process Full is performed. If, on the other hand, the partition is processed but some indexes or aggregations inside of it are not processed, then only those indexes and aggregations will be processed.
Unprocess Deletes all structures, data, indexes, and aggregations from the partition.

There are a different set of issues to consider when choosing a partition processing strategy compared to a dimension processing strategy. Dimension data changes naturally, and these changes are something we have to plan for. On the other hand, existing data in our fact tables does not usually change over time. New rows will appear in our fact tables and recently added rows may be updated, but unless older data is incorrect or we need to recalculate measure values based on new business requirements then this older data is normally static.

Let's take a look at some scenarios where the different partition processing options can be used:

  • For new partitions, or where data in existing partitions needs to be completely reloaded, we have to do a Process Full. The important thing here is to set the slices for our partitions so that when we do need to do a Process Full we do the minimum amount of processing necessary. For example, if new data is loaded into our fact tables every month, then every month we can create a new Analysis Services partition to hold that new month's data and run a Process Full on just that new partition. Assuming that the data for previous months does not change, no other partitions need to be processed.

  • Process Add is normally only used with Standard Edition of Analysis Services, where we are only allowed to have one partition per measure group and so strategies such as the one described in the previous bullet are not possible. With only one partition to store all of our data, a Process Full will involve reloading all of the data from our fact table whether it is new or not and could take a long time. Using Process Add to load only new rows is going to take much less time, even if we still need to schedule an occasional Process Full as noted above.

  • If our partitions are very large, then processing data and building aggregations might require a great deal of memory. It might be better to perform a Process Data simultaneously for all the partitions that need to be processed and then, when this has finished and the memory has been released, we can run a Process Index on them. In this way we can separate the two processing steps and reduce memory pressure on the server while also having greater control over when load is placed on the relational database.

    It is very difficult to say whether this approach will benefit processing performance—thorough testing is necessary. If memory is not an issue and we have a powerful server, then processing both data and indexes at the same time might increase parallelism because when the CPUs are waiting on the I/O operations involved in a Process Data, they can be used to build aggregations and indexes instead.

  • If dimensions have been updated using Process Update, then some aggregations on existing, processed partitions may no longer be valid and will need to be rebuilt. In this case running a Process Default on all these partitions will rebuild only the invalid aggregations. Therefore if we are using Process Update and loading data into newly-created partitions at the same time, then our workflow needs to be as follows:

    • Run a Process Update on all dimensions that need it.

    • Build any new partitions that are necessary and perform any other partition management tasks.

    • Run a Process Default on all the partitions in the cube. This will have the same effect as running a Process Full on the new partitions, and rebuild any invalid aggregations on the existing partitions.

In order to be able to query a cube we also need to run a Process Structure on it, which usually takes just a few seconds. This can be run the first time a cube is deployed to production and will only be necessary afterwards if the cube's structure is changed. If we do this and do not process any partitions then the cube can be queried but will contain no data; we can then process partitions individually and as each partition finishes processing its data will appear in the cube. This behavior can be useful in a development environment if we want an exact copy of our production cube but don't want to load all of the data into it. In this case we can run a Process Structure on the cube and then a Process Full on just one or two partitions.

Processing an object will lead to the Storage Engine caches related to that object and the entire Formula Engine cache being cleared: if the data in the cube has changed then any data in the cache may no longer be valid. This is especially important if we have to perform any kind of processing during the day because query performance will suffer as a result, and the more often we process the less benefit we will get from caching.


Lazy Aggregations

Lazy Aggregations is a processing option for partitions. Normally when we run a Process Full on a partition it can be queried when both the Process Data and Process Index steps have completed. If we want the partition to be available sooner, we can set the Processing Mode property on the partition to Lazy Aggregations. This means that the partition becomes available as soon as the Process Data step has completed; the Process Index step will then be run in the background as and when resources are available. The price we pay for having the partition available earlier is that, while aggregations and indexes are not built, queries against the partition will perform worse.

Using the Lazy Aggregations option we lose control over when the aggregations and indexes for a specific partition will be built because it will depend upon several factors, most of which are out of our control. Also Lazy Aggregations interfere with our ability to synchronize a cube, as we will see later. As a result we do not recommend the use of this option.

Processing reference dimensions

Materialized reference relationships result in a join being made between the fact table and the intermediate dimension table in the SQL generated for partition processing. This highlights the important point that materialized reference relationships are resolved during partition processing and not dimension processing.

Apart from the negative impact that materialized relationships have on partition processing performance, using them also carries a less obvious but more serious penalty: if the intermediate dimension in the relationship ever changes its structure, we have to run a Process Full on all of our partitions.

Where a Region dimension joins to a measure group through a Country attribute on a Customer dimension using a referenced relationship. If that relationship is materialized, then during partition processing the Customer dimension table will be joined to the fact table and the key of the Country that each Customer lives in will be stored in the partition. But what happens if we run a Process Update on the Customer dimension and Customers change the Country they live in? The materialized referenced relationship data on existing partitions will not be refreshed, so Region values for these partitions will be incorrect. Even worse, if we then create new partitions and run a Process Full on them these new partitions will contain the correct materialized relationship data, so Region values will be completely inconsistent. Only a Process Full on all partitions can ensure that queries using the Region dimension will return correct data.

This problem does not occur if the referenced relationship is not materialized. As a result, this is one more good reason not to use materialized referenced relationships.

Handling processing errors

In an ideal world Analysis Services processing would never fail. Certainly, we should always try to ensure that all data integrity issues are handled during the ETL for our relational data warehouse rather than in Analysis Services. However, in the real world there will always be some problems with the data in our data warehouse, and we need to configure Analysis Services processing to handle them.

What happens for each Analysis Services object when there is a processing error is controlled by its ErrorConfiguration property. The default setting is for processing to fail whenever an error occurs, but we can override this for different types of error and either ignore the error completely, ignore anything up a given number or errors and then fail processing, or log the error to a text file and continue processing. We also have two options that govern what happens when we ignore an error: we can either ignore the problematic fact table or dimension table row completely or assign the values from them to a special, automatically generated member called the Unknown Member on a hierarchy. The existence and visibility of unknown members is controlled by the UnknownMember property of a dimension: it can either be set to None, Visible, or Hidden.

We can also set these properties for each processing operation by clicking on the Change Settings button in the Process dialog, and then going to the Dimension Key Errors tab.

We recommend handling processing errors by setting the ErrorConfiguration property on objects rather than having to remember to set them every time we do processing. Discarding rows when they contain errors is probably a bad thing to do since it means our cube could potentially contain incorrect values; it is usually better to assign these values to a visible Unknown Member. This way even if we can't see these values assigned correctly for a particular hierarchy the totals seen at the All Member will still be correct and, when browsing the cube, we'll be able to see where exactly these errors are occurring. Remember, though, this use of Unknown Members is purely an insurance policy in case errors get past the checks we have implemented in our ETL.

We probably don't want processing to completely fail either if an error is encountered; it's better to let processing continue and then reprocess objects later once we have fixed the data problems. Therefore we should ignore the number of errors generated, use the Report and Continue option for each type of error, and then log these errors to a text file.

Managing processing with Integration Services

Just as we used Integration Services to manage our partitions, we can also use it to manage our dimension and partition processing very easily. Using Integration Services gives us the following benefits:

  • It gives us complete control over the order in which objects are processed and the type of processing used. It's very easy to implement even very complex processing strategies using the Integration Services control flow.

  • Using Integration Services' own logging features (which we should already be using in our ETL) we will be able to record exactly what gets processed, how long each step takes, and any errors that are raised.

  • If processing fails, for whatever reason, Integration Services makes it very easy to do things such as send an email to the cube administrator to tell them that this has happened.

  • As we have seen, in Integration Services we can use a script task to automatically manage our partitions using AMO. As a result it makes sense to manage partition processing in the same place.

There are four main methods we can use to manage processing within Integration Services:

  • Using the Analysis Services Processing Task is our recommended method. It's very easy to configure and allows us to process multiple objects either sequentially or in parallel, in a single batch. However, there may be scenarios where we might not choose to use it. For example, the list of objects that we can set the task to process is static and cannot be changed, even with Integration Services expressions. In some cases we might only want to run a Process Update on a large dimension if we know its structure has changed; we could certainly build this kind of conditional logic into the control flow and have one Analysis Serviced Processing Task for each dimension, but there may well be easier ways of doing this such as writing AMO code.

  • We can also execute XMLA processing commands directly using the Execute Analysis Services DDL Task. This gives us complete flexibility with our processing options, but XMLA commands are not easy for humans to read and therefore they are difficult to maintain. We can generate XMLA processing commands very easily by right-clicking on the object we want to process in the Object Explorer pane in SQL Management Studio, selecting Process, and then in the Process dialog clicking the Script button instead of the OK button.

  • We can write .NET code using AMO to process objects, just as we did to create and delete partitions. While this works usually gives us no particular advantages over the other methods, it will take slightly longer to develop and will be harder to maintain. It's a good option when we have very complex logic determining what needs processing, and it would be too difficult to implement this logic in the control flow.

  • External applications, for example the ASCMD utility described below, that can manage processing can be invoked using the Execute Process Task.

The ASCMD utility is a command-line tool that can execute XMLA, MDX, or DMX commands against Analysis Services. A detailed list of its functionality can be found in its readme here:http://tinyurl.com/ascmdreadme and the Analysis Services 2008 version can be downloaded from here:http://tinyurl.com/ascmddownload. It's extremely useful in situations where we can't use Integration Services, but if we do have a choice then using Integration Services is always the easiest option.


A simple Integration Services package that performs processing might look something like this:

In this example we're running our partition management tasks simultaneously with our dimension processing, where we're running a Process Update on each dimension that may have changed. After both of these operations have completed, we then use another processing task to process, in parallel, all of the measure groups in our cube with a Process Default so that new partitions are fully processed and any existing partitions with invalid aggregations have their aggregations rebuilt.

Push-mode processing

So far, when we've processed dimensions or partitions we've been sending XMLA commands to Analysis Services to initiate this. Analysis Services then generates whatever SQL is necessary and runs it against a relational data source, and loads the data returned by these queries into the object that is being processed. This is known as pull-mode processing: Analysis Services is pulling data from a data source.

There is also another, much less often used, method for processing objects: push-mode processing. In push-mode processing we can use Integration Services to push data from a data flow into a partition or dimension, using a Partition Processing or a Dimension Processing destination. No SQL is generated by Analysis Services in this case—we are pushing data into Analysis Services ourselves.

Push-mode processing is useful when we need to load data into Analysis Services from data sources that are not officially supported and for which Analysis Services would not be able to generate valid SQL (such as text files, or databases like MySQL). It is also an alternative to running a Process Add on a partition, or a Process Add or a Process Update on a dimension, in situations where data needs to be loaded into Analysis Services; however, like Process Add and Process Update, regular use can lead to performance degradation and so running a Process Full occasionally is necessary.

Proactive caching

The proactive caching features in Analysis Services allow processing to be kicked off automatically in response to changes in the relational data. When they were first introduced in Analysis Services 2005 they were very heavily hyped but the truth is that they are almost never used in a production environment. This isn't because there are any major problems with the functionality, but because there are easier ways of handling the requirement for "real-time" data.

In an environment where Analysis Services cubes are built on top of a traditional data warehouse, when we know when data is loaded into the data warehouse, it's easier to schedule Analysis Services processing as part of the overall data warehouse ETL. That way we can kick off processing when we're sure that the fact and dimension tables have finished loading, and have finished successfully. Even if the cube needs to be updated regularly during the day, scheduling that processing at regular intervals using SQL Server Agent and Integration Services means that we can be sure when that processing will take place, that it does not take place too often and therefore impact query performance, and be able to log and handle any errors appropriately.

We only recommend the use of proactive caching on less formal Analysis Services implementations, for example when cubes are built on tables in an OLTP database. In these cases, where there is no ETL or data warehouse in place and where data volumes are relatively small, using proactive caching is often easier than having to develop and maintain an Integration Services package.

Other -----------------
- Protecting SQL Server Data : CELL-LEVEL ENCRYPTION - Views and Stored Procedures (part 2) - Creating the Stored Procedures
- Protecting SQL Server Data : CELL-LEVEL ENCRYPTION - Views and Stored Procedures (part 1) - Creating the View
- Protecting SQL Server Data : Implementing Cell-Level Encryption
- Protecting SQL Server Data : Preparing for Cell-Level Encryption
- Microsoft SQL Server 2008 R2 : Monitoring Replication (part 2) - New and Improved Peer-to-Peer Replication
- Microsoft SQL Server 2008 R2 : Monitoring Replication (part 1) - Replication Monitoring SQL Statements
- Microsoft SQL Server 2008 R2 : Scripting Replication
- Processing and Storing Data in SQL Server 2005 : Data Migration from One Data Store to Another Data Store
- Processing and Storing Data in SQL Server 2005 : Implementing the Record Failure Code
- Processing and Storing Data in SQL Server 2005 : Data Tracking Validation
- Processing and Storing Data in SQL Server 2005 : Updating the FileWorker Class
- Microsoft SQL Server 2008 R2 : Setting Up Replication (part 4) - Creating Subscriptions
- Microsoft SQL Server 2008 R2 : Setting Up Replication (part 3) - Horizontal and Vertical Filtering
- Microsoft SQL Server 2008 R2 : Setting Up Replication (part 2) - Creating a Publication
- Microsoft SQL Server 2008 R2 : Setting Up Replication (part 1) - Creating a Distributor and Enabling Publishing
- SQL Server 2008 R2 : Basing the Replication Design on User Requirements
- SQL Server 2008 R2 : Planning for SQL Server Data Replication & SQL Server Replication Types
- SQL Server 2008 R2 : Replication Agents
- SQL Server 2008 : Replication - Subscriptions
- SQL Server 2008 : Replication Scenarios
 
 
Top 10
- Designing the Right Data Storage Structure for Exchange Server 2010 (part 3)
- Windows Server 2008 : Understanding the Windows AIK (part 5) - Understanding Sysprep
- Windows 7 : Setting Up User Security - Determining Who Is Logged On
- SQL Server 2008 : Viewing and Modifying Data (part 3) - Creating Functions and Creating Triggers
- Security Management in the Cloud - Security Vulnerability, Patch, and Configuration Management (part 2)
- Windows Phone 7 : Searching Marketplace
- Exchange Server 2010 : Designing and Implementing AD RMS Integration (part 3) - Transport and Journal Report Decryption
- Windows Phone 7 : Image and ImageSource
- Windows 7 : Preventing Users from Logging On at Certain Times
- User Interface : Creating an Animated Splash Screen
Most view
- Windows Azure: Building a Secure Backup System (part 6) - Uploading Efficiently Using Blocks
- SharePoint 2010: Change the Home Page of a Site
- SharePoint 2010 : Use Built-in Web Parts (part 4) - Use the Content Query Web Part in SharePoint Server
- Microsoft ASP.NET 3.5 : Web Services for ASP.NET AJAX Applications (part 1) - Remote Calls via Web Services
- Windows Server 2008 : Disaster Scenario Troubleshooting
- Developing Applications for Windows Phone 7 : Visual Grammar
- User Interface : Customizing the Soft Input Panel Keyboard to Accept Only Numbers
- Windows Server 2008 : Deploying Terminal Services Gateway
- Auditing an Existing Site to Identify SEO Problems (part 2) - The Importance of Keyword Reviews
- Navigating the Central Administration Home Page (part 3) - Central Administration Page Option