During development we can process cubes or dimensions
whenever we want; in a production environment processing needs more
thought though. On a very simple project we might be able to get away
with doing a full process on our entire Analysis Services database if it
only takes a few minutes to complete. However, we usually have large
data volumes to manage and a limited amount of time to perform any
processing, so a full process simply isn't feasible—we need to think
carefully how we can only process the data that needs to be processed,
and do that processing in the most efficient way possible.
Analysis Services processing can be broken down into two different tasks:
Dimension Processing involves loading data into a dimension and building indexes on it.
Partition Processing
is more complex. Before we can query a cube, all of the dimensions in
the cube need to be processed and we need to process the cube itself
too. A cube is made up of measure groups and a measure group is made up
of partitions, and since partitions are where the data in a cube is
actually stored then when we talk about processing a cube what we are
really talking about is processing all of the partitions in a cube.
Processing a partition involves loading data into it, building indexes,
and building any aggregations specified in the aggregation design
associated with this partition.
MOLAP, HOLAP, or ROLAP?
So far we've assumed that
we're always going to use MOLAP storage mode for all of our partitions.
That's because 99% of the time we will be! MOLAP
storage involves storing all data in Analysis Services' own data
structures; it gives us the fastest query performance but involves the
longest processing times. ROLAP
storage leaves all data in the relational database, and when MDX
queries are run then Analysis Services generates SQL statements to
retrieve the data it needs; querying is relatively slow as a result but
processing is very fast. HOLAP
storage is a cross between the two and involves storing aggregations
and indexes in MOLAP mode but everything else in ROLAP mode.
ROLAP storage for partitions
is only very rarely useful, for when we really do need to see real-time
data inside a cube, and even then the poor query performance of ROLAP
partitions means it can only be used with relatively small data volumes.
Since fast query performance is the number one priority for any
Analysis Services cube the extra time needed for processing MOLAP
partitions is usually a price worth paying. We have never seen a
scenario where HOLAP storage is the right choice, so we can safely
ignore it.
Similarly, MOLAP storage is almost always the right choice for dimensions.
Dimension processing
In this section we'll
discuss the various different types of processing that can be performed
on a dimension and what they actually do. For obvious reasons we'll want
to choose the option that loads the data we need into the dimension in
the quickest and most efficient way. However, when processing a
dimension we also need to be aware of the side effects that this
processing might have on any cubes that the dimension is present in. If,
for example, we perform a Process Update on a dimension, then this
might invalidate aggregations that include hierarchies from this
dimension, meaning they need to be rebuilt too. In this case and others,
the side effects are more significant than dimension processing itself:
the processing of a single dimension is usually a relatively fast
operation while rebuilding aggregations can take much longer.
Clicking on the Impact Analysis button in the Process
dialog in either SQL Management Studio or BI Development Studio will
list all of the objects that are affected by any type of processing on
any object.
Here is a complete list of all
of the options we have for dimension processing, along with a brief
description of what each one does and any side effects:
|
Type
|
Description
|
|---|
|
Full
|
The dimension's structure is deleted and rebuilt, and all dimension data
is reloaded from scratch. In addition, all measure groups related to
this dimension, and all partitions in these measure groups, require a
Full process before they can be queried.
|
|
Add
|
Compares the contents of the relational dimension table with the
contents of the Analysis Services dimension, adds any new members that
are found but does not delete or update existing members: this means
that you run the risk that important changes to existing dimension
members are not made. It does not invalidate existing partition data or
aggregations.
Also known as incremental processing, this option is not visible in SQL
Management Studio or BI Development Studio. For more information on how
to use process Add, see http://tinyurl.com/gregprocessadd.
|
|
Data
|
Loads data into the dimension but does not process dimension indexes. It has the same side effects as Process Full.
|
|
Index
|
Usually used after Process Data, it builds the bitmap indexes of the
dimension and does not require any access to the relational database.
|
|
Type
|
Description
|
|
Update
|
Compares the contents of the relational dimension table with the
contents of the Analysis Services dimension and then deletes any members
from the Analysis Services dimension that no longer exist in the
relational table, updates any members that have changed and adds any new
members.
Since existing members on
hierarchies with at least one flexible attribute relationship in the
chain of attribute relationships down to the key attribute can change,
then all aggregations including these hierarchies will be dropped. They
can be rebuilt later by running a Process Default on affected cubes or
measure groups.
Over time the performance of a dimension will degrade after multiple
Process Updates. As a result, if you do use Process Update regularly
it's a good idea to schedule an occasional Process Full, for example at a
weekend or public holiday when it will cause minimal disruption.
|
|
Unprocess
|
Deletes all structures and data for the dimension.
|
|
Default
|
Performs whatever processing is necessary to bring the dimension up to a fully processed state.
|
With these options in mind,
here, in order of increasing complexity, are some of the possible
strategies we can use for processing dimensions:
We can run a
Process Full on all of our dimensions, which of course means we then
have to run a Process Full on all of our cubes. As we have already said
this is the simplest option but is only feasible if we have enough time
to do this—and can be sure that we will have enough time in the future
as the data volumes in our data warehouse grow. Remember that adding
hardware can have a dramatic impact on processing performance, and may
be cheaper than developing and maintaining a more complex processing
solution.
We can run
a Process Update on any dimensions that have changed their contents,
and then run a Process Default on our cube to rebuild any invalidated
aggregations.
If
existing members on dimensions never change we can run a Process Add to
add any new members that appear. Although Process Add can be much more
difficult to configure than Process Update, it can perform significantly
better and in scenarios where we are only making Type 2 changes to a
dimension then it is safe to use; the fact that it does not invalidate
existing partition data or aggregations is also very important.
If
existing members do change but we don't mind waiting for these changes
to appear in the Analysis Services dimension, then we can run regular
Process Adds and then run a Process Update or even a Process Full on a
less regular basis. This can, for example, be useful if we need to add
new members to a dimension during the day with minimal impact to our
users but can wait until our nightly processing window for our
dimensions to be completely updated.
For very large dimensions
consisting of several million members or a large number of attributes,
splitting a Process Full up into separate Process Data and Process Index
operations may allow us to make better use of available memory. Whether
it does or not will depend on the structure of our dimension and our
server's hardware; thorough testing is necessary to determine whether it
is worth doing this. Similarly, setting the AttributeHierarchyOptimizedState and AttributeHierarchyOrdered properties to False may also improve processing performance at the expense of, respectively, query performance and member ordering.
Last of all, remember that many
structural changes to a dimension (for example adding a new hierarchy)
will mean that a Process Full is necessary. As a result such structural
changes will need to be carefully planned for.
Partition processing
As with dimensions, partitions
can be processed in several ways. Processing a large partition is a very
slow, CPU intensive operation and as a result partition processing
normally takes longer than dimension processing; we therefore have to
think more carefully about how to process our partitions in the most
efficient way. Luckily, partition processing is also a lot less complex
than dimension processing in that processing a partition has no side
effects on any other objects.
The following table lists the different options for processing a partition:
|
Type
|
Description
|
|---|
|
Full
|
The partition is completely rebuilt. All data inside the partitions is
deleted and reloaded from the relational database; indexes and
aggregations are rebuilt. It is equivalent to an Unprocess, followed by a
Process Data, followed by a Process Index.
|
|
Add
|
Also known as incremental processing, this adds new data to the
partition. After we've specified a table or SQL query containing new
fact rows, Analysis Services creates a new partition, fully processes
it, and then merges the new partition with the existing one.
Over time the performance of a partition will degrade after multiple
Process Adds. Once again it's a good idea to schedule an occasional
Process Full if we are using Process Add regularly.
|
|
Type
|
Description
|
|
Data
|
Behaves the same as Process Full but does not process indexes and aggregations.
|
|
Index
|
Builds indexes and aggregations for the partition. This is normally used after a Process Data.
|
|
Default
|
Does whatever processing is necessary to bring the partition up to a
fully processed state. If the partition itself is in an unprocessed
state, then a Process Full is performed. If, on the other hand, the
partition is processed but some indexes or aggregations inside of it are
not processed, then only those indexes and aggregations will be
processed.
|
|
Unprocess
|
Deletes all structures, data, indexes, and aggregations from the partition.
|
There are a different set
of issues to consider when choosing a partition processing strategy
compared to a dimension processing strategy. Dimension data changes
naturally, and these changes are something we have to plan for. On the
other hand, existing data in our fact tables does not usually change
over time. New rows will appear in our fact tables and recently added
rows may be updated, but unless older data is incorrect or we need to
recalculate measure values based on new business requirements then this
older data is normally static.
Let's take a look at some scenarios where the different partition processing options can be used:
For new partitions,
or where data in existing partitions needs to be completely reloaded, we
have to do a Process Full. The important thing here is to set the
slices for our partitions so that when we do need to do a Process Full
we do the minimum amount of processing necessary. For example, if new
data is loaded into our fact tables every month, then every month we can
create a new Analysis Services partition to hold that new month's data
and run a Process Full on just that new partition. Assuming that the
data for previous months does not change, no other partitions need to be
processed.
Process
Add is normally only used with Standard Edition of Analysis Services,
where we are only allowed to have one partition per measure group and so
strategies such as the one described in the previous bullet are not
possible. With only one partition to store all of our data, a Process
Full will involve reloading all of the data from our fact table whether
it is new or not and could take a long time. Using Process Add to load
only new rows is going to take much less time, even if we still need to
schedule an occasional Process Full as noted above.
If our
partitions are very large, then processing data and building
aggregations might require a great deal of memory. It might be better to
perform a Process Data simultaneously for all the partitions that need
to be processed and then, when this has finished and the memory has been
released, we can run a Process Index on them. In this way we can
separate the two processing steps and reduce memory pressure on the
server while also having greater control over when load is placed on the
relational database.
It is
very difficult to say whether this approach will benefit processing
performance—thorough testing is necessary. If memory is not an issue and
we have a powerful server, then processing both data and indexes at the
same time might increase parallelism because when the CPUs are waiting
on the I/O operations involved in a Process Data, they can be used to
build aggregations and indexes instead.
If
dimensions have been updated using Process Update, then some
aggregations on existing, processed partitions may no longer be valid
and will need to be rebuilt. In this case running a Process Default on
all these partitions will rebuild only the invalid aggregations.
Therefore if we are using Process Update and loading data into
newly-created partitions at the same time, then our workflow needs to be
as follows:
Run a Process Update on all dimensions that need it.
Build any new partitions that are necessary and perform any other partition management tasks.
Run
a Process Default on all the partitions in the cube. This will have the
same effect as running a Process Full on the new partitions, and
rebuild any invalid aggregations on the existing partitions.
In order to be able to query a
cube we also need to run a Process Structure on it, which usually takes
just a few seconds. This can be run the first time a cube is deployed to
production and will only be necessary afterwards if the cube's
structure is changed. If we do this and do not process any partitions
then the cube can be queried but will contain no data; we can then
process partitions individually and as each partition finishes
processing its data will appear in the cube. This behavior can be useful
in a development environment if we want an exact copy of our production
cube but don't want to load all of the data into it. In this case we
can run a Process Structure on the cube and then a Process Full on just
one or two partitions.
Processing an object
will lead to the Storage Engine caches related to that object and the
entire Formula Engine cache being cleared: if the data in the cube has
changed then any data in the cache may no longer be valid. This is
especially important if we have to perform any kind of processing during
the day because query performance will suffer as a result, and the more
often we process the less benefit we will get from caching.
Lazy Aggregations
Lazy Aggregations
is a processing option for partitions. Normally when we run a Process
Full on a partition it can be queried when both the Process Data and
Process Index steps have completed. If we want the partition to be
available sooner, we can set the Processing Mode property on the partition to Lazy Aggregations.
This means that the partition becomes available as soon as the Process
Data step has completed; the Process Index step will then be run in the
background as and when resources are available. The price we pay for
having the partition available earlier is that, while aggregations and
indexes are not built, queries against the partition will perform worse.
Using the Lazy Aggregations
option we lose control over when the aggregations and indexes for a
specific partition will be built because it will depend upon several
factors, most of which are out of our control. Also Lazy Aggregations
interfere with our ability to synchronize a cube, as we will see later.
As a result we do not recommend the use of this option.
Processing reference dimensions
Materialized reference relationships result in a join
being made between the fact table and the intermediate dimension table
in the SQL generated for partition processing. This highlights the
important point that materialized reference relationships are resolved
during partition processing and not dimension processing.
Apart from
the negative impact that materialized relationships have on partition
processing performance, using them also carries a less obvious but more
serious penalty: if the intermediate dimension in the relationship ever
changes its structure, we have to run a Process Full on all of our
partitions.
Where a Region dimension joins to a measure group through a Country attribute on a Customer dimension using a referenced relationship. If that relationship is materialized, then during partition processing the Customer
dimension table will be joined to the fact table and the key of the
Country that each Customer lives in will be stored in the partition. But
what happens if we run a Process Update on the Customer
dimension and Customers change the Country they live in? The
materialized referenced relationship data on existing partitions will
not be refreshed, so Region values for these partitions will be
incorrect. Even worse, if we then create new partitions and run a
Process Full on them these new partitions will contain the correct
materialized relationship data, so Region values will be completely
inconsistent. Only a Process Full on all partitions can ensure that
queries using the Region dimension will return correct data.
This problem does not occur if
the referenced relationship is not materialized. As a result, this is
one more good reason not to use materialized referenced relationships.
Handling processing errors
In an ideal world Analysis Services processing would never fail. Certainly,
we should always try to ensure that all data integrity issues are
handled during the ETL for our relational data warehouse rather than in
Analysis Services. However, in the real world there will always be some
problems with the data in our data warehouse, and we need to configure
Analysis Services processing to handle them.
What happens for each Analysis Services object when there is a processing error is controlled by its ErrorConfiguration
property. The default setting is for processing to fail whenever an
error occurs, but we can override this for different types of error and
either ignore the error completely, ignore anything up a given number or
errors and then fail processing, or log the error to a text file and
continue processing. We also have two options that govern what happens
when we ignore an error: we can either ignore the problematic fact table
or dimension table row completely or assign the values from them to a
special, automatically generated member called the Unknown Member on a
hierarchy. The existence and visibility of unknown members is controlled
by the UnknownMember property of a dimension: it can either be set to None, Visible, or Hidden.
We can also set these properties for each processing operation by clicking on the Change Settings button in the Process dialog, and then going to the Dimension Key Errors tab.
We recommend handling processing errors by setting the ErrorConfiguration
property on objects rather than having to remember to set them every
time we do processing. Discarding rows when they contain errors is
probably a bad thing to do since it means our cube could potentially
contain incorrect values; it is usually better to assign these values to
a visible Unknown Member. This way even if we can't see these values
assigned correctly for a particular hierarchy the totals seen at the All
Member will still be correct and, when browsing the cube, we'll be able
to see where exactly these errors are occurring. Remember, though, this
use of Unknown Members is purely an insurance policy in case errors get
past the checks we have implemented in our ETL.
We probably don't want
processing to completely fail either if an error is encountered; it's
better to let processing continue and then reprocess objects later once
we have fixed the data problems. Therefore we should ignore the number
of errors generated, use the Report and Continue option for each type of error, and then log these errors to a text file.
Managing processing with Integration Services
Just as we used Integration
Services to manage our partitions, we can also use it to manage our
dimension and partition processing very easily. Using Integration
Services gives us the following benefits:
It gives us
complete control over the order in which objects are processed and the
type of processing used. It's very easy to implement even very complex
processing strategies using the Integration Services control flow.
Using
Integration Services' own logging features (which we should already be
using in our ETL) we will be able to record exactly what gets processed,
how long each step takes, and any errors that are raised.
If
processing fails, for whatever reason, Integration Services makes it
very easy to do things such as send an email to the cube administrator
to tell them that this has happened.
As
we have seen, in Integration Services we can use a script task to
automatically manage our partitions using AMO. As a result it makes
sense to manage partition processing in the same place.
There are four main methods we can use to manage processing within Integration Services:
Using the
Analysis Services Processing Task is our recommended method. It's very
easy to configure and allows us to process multiple objects either
sequentially or in parallel, in a single batch. However, there may be
scenarios where we might not choose to use it. For example, the list of
objects that we can set the task to process is static and cannot be
changed, even with Integration Services expressions. In some cases we
might only want to run a Process Update on a large dimension if we know
its structure has changed; we could certainly build this kind of
conditional logic into the control flow and have one Analysis Serviced
Processing Task for each dimension, but there may well be easier ways of
doing this such as writing AMO code.
We
can also execute XMLA processing commands directly using the Execute
Analysis Services DDL Task. This gives us complete flexibility with our
processing options, but XMLA commands are not easy for humans to read
and therefore they are difficult to maintain. We can generate XMLA
processing commands very easily by right-clicking on the object we want
to process in the Object Explorer pane in SQL Management Studio, selecting Process, and then in the Process dialog clicking the Script button instead of the OK button.
We can write .NET
code using AMO to process objects, just as we did to create and delete
partitions. While this works usually gives us no particular advantages
over the other methods, it will take slightly longer to develop and will
be harder to maintain. It's a good option when we have very complex
logic determining what needs processing, and it would be too difficult
to implement this logic in the control flow.
External
applications, for example the ASCMD utility described below, that can
manage processing can be invoked using the Execute Process Task.
The ASCMD utility is a
command-line tool that can execute XMLA, MDX, or DMX commands against
Analysis Services. A detailed list of its functionality can be found in
its readme here:http://tinyurl.com/ascmdreadme and the Analysis Services 2008 version can be downloaded from here:http://tinyurl.com/ascmddownload.
It's extremely useful in situations where we can't use Integration
Services, but if we do have a choice then using Integration Services is
always the easiest option.
A simple Integration Services package that performs processing might look something like this:
In this example we're running
our partition management tasks simultaneously with our dimension
processing, where we're running a Process Update on each dimension that
may have changed. After both of these operations have completed, we then
use another processing task to process, in parallel, all of the measure
groups in our cube with a Process Default so that new partitions are
fully processed and any existing partitions with invalid aggregations
have their aggregations rebuilt.
Push-mode processing
So far, when we've
processed dimensions or partitions we've been sending XMLA commands to
Analysis Services to initiate this. Analysis Services then generates
whatever SQL is necessary and runs it against a relational data source,
and loads the data returned by these queries into the object that is
being processed. This is known as pull-mode processing: Analysis Services is pulling data from a data source.
There is also another, much less often used, method for processing objects: push-mode processing. In push-mode
processing we can use Integration Services to push data from a data
flow into a partition or dimension, using a Partition Processing or a
Dimension Processing destination. No SQL is generated by Analysis
Services in this case—we are pushing data into Analysis Services
ourselves.
Push-mode processing is
useful when we need to load data into Analysis Services from data
sources that are not officially supported and for which Analysis
Services would not be able to generate valid SQL (such as text files, or
databases like MySQL). It is also an alternative to running a Process
Add on a partition, or a Process Add or a Process Update on a dimension,
in situations where data needs to be loaded into Analysis Services;
however, like Process Add and Process Update, regular use can lead to
performance degradation and so running a Process Full occasionally is
necessary.
Proactive caching
The proactive caching
features in Analysis Services allow processing to be kicked off
automatically in response to changes in the relational data. When they
were first introduced in Analysis Services 2005 they were very heavily
hyped but the truth is that they are almost never used in a production
environment. This isn't because there are any major problems with the
functionality, but because there are easier ways of handling the
requirement for "real-time" data.
In an environment where
Analysis Services cubes are built on top of a traditional data
warehouse, when we know when data is loaded into the data warehouse,
it's easier to schedule Analysis Services processing as part of the
overall data warehouse ETL. That way we can kick off processing when
we're sure that the fact and dimension tables have finished loading, and
have finished successfully. Even if the cube needs to be updated
regularly during the day, scheduling that processing at regular
intervals using SQL Server Agent and Integration Services means that we
can be sure when that processing will take place, that it does not take
place too often and therefore impact query performance, and be able to
log and handle any errors appropriately.
We only recommend the use of
proactive caching on less formal Analysis Services implementations, for
example when cubes are built on tables in an OLTP database. In these
cases, where there is no ETL or data warehouse in place and where data
volumes are relatively small, using proactive caching is often easier
than having to develop and maintain an Integration Services package.