Thursday, August 27, 2015

Supporting Multiple First Day of the Week Calendars in a Cube

A not uncommon requirement is to support weeks with different start dates. For example, the Finance team might want to see weeks as Tuesday through Monday, while a store manager A might prefer to see weeks from Monday through Sunday, and store manager B might prefer to see weeks as Saturday thru Friday. Do not create new dimensions for this requirement. That is unnecessary and would lead to degraded query performance.
 
It is quite simple to support multiple week starting dates, without complicating your cube or degrading performance. You should already have a Date dimension, with one member per calendar day. That's all you need. Now, just add attributes to each day denoting the weekno and dayofweek for each of the, up to 7, week start days. Just to be clear, the number of records in your Date table/view will not change, just new columns.
In your existing date dimension table (or view in my projects) add in, up to, 7 new columns for WeekStartingMonday, WeekStartingTuesday, WeekStartingWednesday, etc, which will contain the WeekId for the week defined by the start date. For the first (Monday thru Sunday) week in the year, these days will all have a WeekStartingMonday key of 1. For the first (Tuesday thru Monday) week in the year, these days will all have a WeekStartingTuesday key of 1. To put it another way, each attribute will have a number between 1 and 53 denoting the week number. These 7 attributes will be subtly different as WeekStartingTuesday will have Tuesday as the first day with the new week etc. You could give these weeks captions as well, but not necessary at this stage. Note, if you want to have full weeks at the beginning and end of each year, you will need to add another 7 attributes, which will be the WeekStartingMondayYear, WeekStartingTuesdayYear etc. This attribute will have the year that the day/week is in (ie a number such as 2015, 2016 etc.) Then you can ensure that each year will have exactly 52 or 53 full weeks.
In the table/view for your date dimension, add in another 7 new columns for WeekStartingMondayDay, WeekStartingTuesdayDay, etc, which will have the DayOfWeekId for the day as it behaves in that week start. For example, WeekStartingMondayDay will always be 1 for Monday, 2 for Tuesday; WeekStartingTuesdayDay will always be 1 for Tuesday, 2 for Wednesday etc.
In your Date dimension, add the new 14 (or 21 if you include year) attributes. For each of the day attributes keep the WeekStartingMondayDay number 1-7 as the key, but provide the name as the dayofweek name (Monday, Tuesday etc.) Eg, the key might be 3 and the name might be 'Friday' for a WeekStartingWednesdayDay member. In SSAS specify the OrderBy attribute =Key. That way, when the attribute is browsed, SSAS will sequence days by the key, not the name. So, WeekStartingWednesdayDay will have Wednesday (key=1) as the first day, Thursday (key=2) as the second day etc. Your week attributes (WeekStartingMonday, WeekStartingTuesday etc) will be OK with the number 1-53 as the key and name. Note, ensure that OrderBy attribute =Key for the weeks also, otherwise, SSAS might perform a character sequencing (by caption) of your weeks.
Create 7 hierarchies, one for each Week Start day, with Week as the first level and day as the lower level. You may also want to add in Year at the top level of the hierarchy.
Now, your users can choose between one of 7 hierarchies for weeks, depending on which day they would like the week to start.
Users can also cross join one of these 7 hierarchies with Dates, then, they will see a heading with the date and the day of week underneath, and broken by weeks.
 
This solution will perform well in your cube. Since these attributes are simply attributes of existing members, the measure groups will not change in size with their addition. Also, these attributes will perform well as they are physical attributes, not calculations.

Wednesday, June 24, 2015

SQL Server Advanced Scan, aka Merry-go-round or Scan Piggybacking

I think SQL Server 2005 was the first SQL Server version with Advanced Scan. I have been excited by this feature, comforted by the thought my databases are performing better, but also a little unsure of its efficacy. According to the documentation, when one query is running a table scan, subsequent queries that require a table scan will "piggyback" on the first scan, avoiding additional scan IO. Apparently, any number of queries can all piggyback on the same scan. Once the first one completes, the others carry on from the beginning, as they may have joined in after the first scan started. So there should never be more than one scan of any one table at a point in time.

What a great feature? However, there are no Profiler traces for Advanced Scan, so how do you know it is actually working? I've just ran a performance test of Advanced Scan with 6 concurrent queries that all scan the same table (with slightly different predicates). I ran another test with one query running solo.

I was impressed with the results.
The table I used is about 10 times the size of the available memory, so SQL had no chance of caching the table. The solo scan ran as fast as the read ahead engines could provide the data. The multiple read engines (from the solo scan) kept the SAN storage busy with an IO queue count constantly greater 50 and reached a steady state of about 100MB/second. I ran this a couple of times to warm up the system. Note, this single query was running with MAXDOP=1. It would not run faster with MAXDOP=0 as the SAN storage was already working at maximum throughput with the workload of the multiple readahead engines.

I then ran 6 queries concurrently that all required a scan of the same table, but with slightly different predicates. These 6 scans all completed in a similar time as the first solo scan! That told me that Advanced Scan was working, but where could I see it? I could not find any evidence in Profiler (apart from the duration) that Advanced Scan was operating. In the queries I had SET STATISTICS IO ON. The statistics had the evidence that Advanced Scan was functioning. All 7 queries had similar logical reads, physical reads, CPU time, and elapsed time. However, the 6 concurrent queries had much smaller read-ahead reads. The total for the 6 concurrent queries was equivalent to the one solo scan.

It's unfortunate, that Profiler does not have an Advanced Scan event (or flag on scan). However, this test does demonstrate that the feature works and that it can be quantified.

Note: Advanced Scan is only available in Enterprise Edition. I initially tried to monitor this feature in a developer version of SQL Server, but it gave mixed results. Also note, the test above had Max Degree of Parallelism set to 1. I received similar results with MAXDOP set to 0.

Tuesday, June 16, 2015

Natural Keys are Healthy

It seems that I often attract suggestions from developers who have recently been reading Ralph Kimball. This week, it was a suggestion to add surrogate keys to all the dimension tables because this "would make the datamart more extensible and scalable"! I had to argue my preference to only use synthetic (surrogate) keys when there is a need. If the natural business key works and there is no need for SCD (slowing changing dimensions or Type 2, as Kimball would say) I would keep the design simple and use the existing natural key. That key is already maintained by the source system and business users are typically familiar with it. In this case the main dimension table contains book titles and has a natural key of ISBN. I think that since the ISBN is an International Standard Book Number, and its attributes don't need SCD support, it would make a perfectly good primary key of the titles dimension table. Also, the fact tables would have ISBN as a foreign key - just like the source system. This requires much less ETL logic and much easier reconciliation. No need to translate surrogate keys when determining which facts are related to ISBN=x. Why would you want to replace ISBN with a synthetic number, which is meaningless to everyone? I struggle with applications that are more complex than they need to be. I think Einstein said something along the lines of "Everything Should Be Made as Simple as Possible, But Not Simpler".

Some might argue that the fact table will be thinner with a numeric foreign key rather than the 13 byte ISBN. That is true, however I think it's risky to remove the natural key (ISBN) from the fact table as it makes reconciliation with ISBN difficult in the event of a bug in the synthetic key allocation. Also, the fact and dimension tables might like indexes on these keys, and with natural and synthetic keys, there might be a need for twice the number of indexes. Overall, I would argue that natural keys offer higher performance.

I have seen datamarts mystified by inappropriate use of surrogate keys. To illustrate, imagine the titles dimension constructed with a synthetic surrogate key and the ISBN no longer unique. This might be argued by a few to be superior as it would support historical integrity. (Hmm, like historical integrity is necessarily superior.) Think about a slowly changing ISBN! Someone enters a new title as "Kate the Cat", when it should have been "Katie the Cat". After a few weeks, it gets noticed and is corrected. However the SCD datamart is too clever and says that it was "Kate the Cat" for those first few weeks, so it must remain so, for historical integrity! From now on, users of this dimension must combine "Kate the Cat" with "Katie the Cat" to get sales information on this book. This can be frustrating for users, who know only one book called "Katie the Cat".

I have also seen datamarts where the dimension tables have blown out by more than 20 times (1 Million customers turns into a 20 Million row customer dimension) because all attributes are maintained with SCD logic. So each customer record had been updated an average of 20 times. Sometimes these changes are irrelevant to the business regarding SCD, such as a field containing the number of Customer Service calls. In this case, every time the customer calls up, there will be a new record in the customer dimension, with just the call count incremented by 1. For what purpose?

I have also seen date dimensions with surrogate keys! This also puzzles me. Does the developer believe that attributes of a date will change and that reports should show those dates with historical integrity? To my simple mind, that just seems to be a quick way to make a datamart/DW more complex for no benefit. Unless, there is benefit in increasing future work loads as maintenance and reconciliation become bigger chores.

I would add, that if you really do identify a need for SCD I would encourage you to only maintain SCD on the identified attributes. Do not apply SCD logic to all attributes blindly.

Please don't see this blog as anti Kimball. I truly believe that he has brought a lot of rigour and maturity to data warehousing. However, to follow his dimensional modelling blindly is not pragmatic and does not help organisations trying to build business intelligence.

Sunday, June 7, 2015

Fast CASE statements in MDX

Sometimes we need to create a CASE statement in an MDX calculation to split a measure value. There was a good example in this msdn forum. For a simple example
CASE 
     WHEN Measures.[MyCalc]=0 THEN 0
     WHEN Measures.[MyCalc]<=5 THEN 5
     WHEN Measures.[MyCalc]<=10 THEN 10
     WHEN Measures.[MyCalc]<=15 THEN 15 
     WHEN Measures.[MyCalc]<=20 THEN 20 
     WHEN Measures.[MyCalc]<=25 THEN 25     
...
     ELSE 100
END

What I have found is that Measures.[MyCalc] is recalculated for every WHEN clause until it drops out. So the CASE statement could be rewritten as below with much greater performance. If your calculations are parsing through 20 WHEN clauses, this new expression could be 20 times faster! This is most noticeable when MyCalc is itself expensive, particularly when it is FE (formula engine) intensive.

CASE Round(Measures.[MyCalc]/5)*5
     WHEN 0 THEN 0
     WHEN 5 THEN 5
     WHEN 10 THEN 10
     WHEN 15 THEN 15 
     WHEN 20 THEN 20 
     WHEN 25 THEN 25     
...
     ELSE 100
END

Note; you cannot put < or > operators in the WHEN statements if you use the second format above. However, with a little arithmetic, you might be able to bracket the calculation as I have done above using Round().
Another note; if might not be possible to express the CASE statement as I suggest above because there are several conditions. However, I encourage you to think laterally, as you might be able to nest a few CASE statements on the various conditions, so there are fewer calcs per CASE statement, making perform much better.

Saturday, June 6, 2015

Confusing = with IS in MDX calculations

This topic has been written about many times, perhaps even by me. However, it is a common pitfall, so I am writing about it again. See this msdn post for a recent example.

Essentially, the pitfall is when an author wants to check if the current member of a dimension attribute is at a particular member. The incorrect MDX might look like this
IIF(DimCustomer.CustomerClass.Currentmember = DimCustomer.CustomerClass.CustomerClass.&[Gold], Measures.Price*0.9,Measures.Price)

However, the "=" operator is actually testing if the tuples on either side are equal in their Measure. Since there is no Measure specified, the measure will be the one on that slice (tuple), or the default measure if there isn't one in the tuple. For the two tuples to be equal is rather random. So the results will be somewhat unpredictable.

What should be used is the "IS" operator, which will test if the members on either side are the same. That is; is the current CustomerClass Gold? So the IIF() should be rewritten as
IIF(DimCustomer.CustomerClass.Currentmember IS DimCustomer.CustomerClass.CustomerClass.&[Gold], Measures.Price*0.9,Measures.Price)

Note, if you are having problems following this logic, the IIF() with the IS operator is equivalent to the following statement.
IIF(DimCustomer.CustomerClass.Currentmember.properties("key") = "Gold", Measures.Price*0.9,Measures.Price)

Tuesday, May 19, 2015

Putting private data to public good

I was very pleased to read about IBM's collaboration with US cancer institutions. Using the history of people's cancer treatments and their progress to help determine the most effective treatment for a newly diagnosed patient. This is a great example of putting private data to public good. I have been looking for similar opportunities in the health sector for many years. There is so much scope to use historical diagnoses, demographic data, treatments and outcomes to forecast the risk and efficacy of various treatment options. However, I have found a pretty solid wall declaring such data couldn't be used for "privacy" reasons. This is a real shame, for if I were diagnosed with an illness where the health specialist was deciding between various drug therapies and surgical options, I would like that specialist to make use of all the information available. Of course, very experienced health specialists will have a vast range of experience and read reams of published studies. But I am sure that a data mining model over national (or even international) databases would help estimate risks and efficacies of the various treatment options.

I find the general concern over "data privacy" intriguing. Whether it's your Google search terms, driving records, telephone calls, everyone must accept that their personal data is held somewhere. Why are we so paranoid about our privacy? We have lived in communities for thousands of years, where we are known about in the community. We could all withdraw with curtains drawn and minimal social interaction but we would then miss out on so many community benefits.









I hope that in the future, we can all open our blinds and see more "private" data put to public good.


Monday, January 19, 2015

Basket Analysis IV - Menu Trim

This is the third post in a series on Basket Analysis. To read the first post go to Basket Analysis I.

Retail stores are always trying to balance range of stocked items with inventory efficiency. A common task is to perform a "menu trim" where low profit contribution items are removed from the menu. A simplistic approach would be to remove the items with the lowest associated sales value or gross profit. However, this can result in trimming items that, while not profitable in themselves, are attracting customers and have high "associated" sales and profit contribution.
Consider the menu items above that are ordered by item sales. The Berry Smoothie is the item with the lowest sales value and gross profit. A simplistic view would be to say let's trim Berry Smoothie off the menu as it only represents $795.23 of sales. However, from our basket analysis query, we can see that the Berry Smoothie has a high Associated Net Sales and Associated Gross Profit. If we took Berry Smoothie off the menu, we might be risking $19,400 of associated sales. This is the total value of all transactions that have included Berry Smoothie. It might be that customers are coming into the store primarily for the Berry Smoothie and getting other items with the smoothie. If we remove Berry Smoothie from the menu these customers might not patronize our store.
Basket Analysis does not answer these questions directly, but it does give us information to make a better informed decision. In the case above, we might consider trimming the Vegeburger from the menu, as it has the lowest Associated Net Sales and lowest Associated Gross Profit. It appears that customers who come in to buy vegeburgers, purchase that item alone. So if that item was trimmed from the menu, it would appear to put less sales at risk.

Friday, December 19, 2014

Basket Analysis III: Promoting Sales Items Associated with Heavy Transactions

This is the third post in a series on Basket Analysis. To read the first post go to Basket Analysis I.

One common retail strategy is to promote a relatively low value item that is associated with "heavy" baskets. That is to identify sales items that tend to be in baskets that have many other profitable items. Sometimes these items are a little quirky and customers go to that particular retailer for the quirky item, and while they are there, they purchase many other items. The rational is that if we can attract more customers to purchase the quirky item, which as little profit contribution, they will purchase many other, profitable, items. So the promotion on the quirky item can be discounted with low, or even negative, gross profit. Basket Analysis technology can help identify these items, and it can also accurately measure the performance of the campaign. 

Here is a graph showing some of the key performance indicators around a promotion of Green Smoothies. This particular promotion increased, by a factor of 5, the sales of Green Smoothies, but these increased sales did not pull through, as hoped, the additional sales. See the green line near the bottom of the chart - it shows the sales value of Green Smoothies, and you can see a clear increase in their sales during the promotion period. Now look at the Average Txn value including Green Smoothies (blue bar) during the promotion period. It drops from about $20/transaction to about $16/transaction. Also, notice the purple line showing the Green Smoothie affected sales. See how it has increased, but only by the value of the increased Green Smoothie sales. This line shows the total value of all transactions that have at least one Green Smoothie. This suggests that promotion shoppers are coming in to purchase the Green Smoothies, but they are only purchasing Green Smoothies. It is valuable to get an early appreciation of the effectiveness of this campaign, so that it can be halted or modified early.



















Note, traditional sales reporting tools could easily report on sales values of Green Smoothies, but it is only with Basket Analysis tools that we can get metrics like Green Smoothie affected sales and Average transaction value including Green Smoothies.

For the next Basket Analysis post see Basket Analysis IV.

Monday, December 15, 2014

Basket Analysis II: Would you like a drink with that?

This is the second post in a series on Basket Analysis. To read the first post go to Basket Analysis I.

A common question a customer receives when ordering food is "Would you like a drink with that?" This is an example of on-selling, where the customer has come in with an order, and the salesperson, cashier, etc. looks for an opportunity to increase the size (and profit contribution) of the transaction, and potential satisfaction of the customer, by inviting the customer to add another item to the transaction.  The activity can be much more subtle, such as "We have some hot-from-the-oven blueberry muffins today - would you like one with your coffee?" The intention is the same, and the activity usually involves some staff training and encouragement. Naturally, some training sessions are better than others and some supervisors, motivations, etc are better than others at encouraging staff to invite these on-sell opportunities. Basket Analysis is a great tool to help monitor and analyse the effectiveness of staff, training, and promotional activities. For example, a program could be developed to encourage staff to offer oven-hot blueberry muffins. Without Basket Analysis tools, a sales report might display the ratio of blueberry muffins per 1000 transactions to compare various staff, training programs etc. However, this report will be inappropriately affected by a customer purchasing a dozen blueberry muffins in one transaction. Basket Analysis will provide metrics such as % of hot drink transactions that contain blueberry muffins.

The charts below report on the effectiveness of the training campaign.  The top graph shows that all 6 cashiers are handing similar numbers of transactions with hot drinks.  The training campaign was conducted during week 29. Notice how most cashiers increased their attach rate of muffins by about 30%.  Elizabeth did not change her attach rate (maybe she missed the training) and Barry's attach rate increased, but not as much as the other cashiers.

For the next Basket Analysis post see Basket Analysis III.

Sunday, December 14, 2014

High Performing Big Table Loads in SSIS

If you are trying to load a very large SQL Server table with fast load, you might be thwarted by not having the rows per batch/commit that you think you have.

Here is a rather vanilla fast table load destination using SSIS data flow task. Notice how I have Rows per batch=200,000 and Maximum insert commit size=200,000. You might expect that the fast load will insert about 200,000 rows per batch and transaction. You would be wrong, as it is also gated by the buffer size of the data flow task. To determine how many rows are being inserted per batch/transaction, simply use SQL Profiler and trace the INSERT BULK commands. They are simply SQL:BatchCompleted events. To see the number of rows, look at the RowCounts column in the trace. It will probably be a little less than 10,000.
Firstly why is 10,000 rows per commit slow? If you have no indexes on your large table, the chances are that the loading is very fast. However, if there are non clustered indexes, the loading is slowed down by sorting the batch records into each index sequence and applying them to the indexes. Applying the keys to the index could touch up to 10,000 pages. Then the updates are committed, requiring writing out to disk all updates. Increasing the commit size can make updating indexes more efficient. Note, the performance does not get forever faster as the commit size increases. There is a memory cost with the large commit sizes and sorting larger groups of records increases at a greater then linear rate. To determine the best size for your task, run and monitor several sizes and plot their speeds. Choose the buffer size (batch/commit size) that maximizes load speed. Note, the optimal speed for a small table, will not be the same once the table becomes very large with indexes larger than db cache memory.

Also note, if you are loading a table from scratch, it can be more efficient to load the table with just a clustered index, then create the indexes in parallel after loading. Of course, partitioned tables will give you more flexibility in tuning your index build and loading.

Anyway, to increase the commit size, you will need to increase the Data Flow buffer size. See below, the DefaultBufferMaxRows is 10,000 and DefaultBufferSize is 10,485,760 bytes. Increasing these configuration settings to get larger commits.


Monday, December 8, 2014

Basket Analysis I: What is Basket Analysis

This is planned to be the first blog in a series on Basket Analysis.  In this blog, I will try and answer the question, What is Basket Analysis?

Basket Analysis technology offers an exciting and innovative way of looking at retail data. Simple business intelligence will provide retailers with all sorts of analysis by individual products, including sales by cashiers, customers, profit contribution, waste, discounts, labour, time of day, weather,etc. All this analysis is essential for most retailers in today's market. However, more and more retailers are using the behaviour of shoppers to look for opportunities to improve their operation.

Basket Analysis is about analysing which sales items are bought together (in the one transaction, or different transactions for the same customer).   A Basket Analysis solution will enable you to query and report on which products are bought together and what the profit contribution of all the "baskets" (transactions) that involve a combination of products that are in, or out of, the transaction.

Basket Analysis is not a set of metrics or KPIs from which you can run your business. It is a data solution that allows you to ask questions about the product mix of transactions. Perhaps the most simple question is "Tell me information on the total transaction value by individual product items".   For every sales item, show me the
  • Item Average Price
  • Item Sales Quantity
  • Item Sales Value
  • Item Cost of Goods
  • Item Gross Profit %
  • Item Average Discount
  • Transaction Sales (total value of transactions including this item)
  • Transaction Count (number of transactions including this item)
  • Transaction Average Value
  • Transaction Average Quantity (average items/transactions)
  • Transaction Average Discount
  • Transaction Cost of Goods
  • Transaction Gross Profit
  • Transaction Gross Profit Percent
  • Transaction Value Multiple (total transaction sales/item sales)
Here is an illustration report of that very simple question that might be asked of a Basket Analysis solution.  It is simple, because the analysis is by individual items.  A more sophisticated query would be on particular combinations (existence, or non-existence) of items.  That is for subsequent blogs in this series.


From this simple report there are already many interesting insights available.  For example
  • Items that have a low sales item value but high transaction value (Transaction Value Multiple) are items that whilst they contribute very little profit themselves maybe bringing in customers who purchase more profitable items.
  • Items with a very low Transaction Value Multiplier, close to 1, are bought alone and probably do not induce customers to buy other products.
  • This simple report for one sales item, such as Cheese Cake, by waiter/cashier can give you insights as to which staff/outlets are good at on selling.
  • Sales items with the highest Transaction Gross Profit are ones that are likely to be important to your business, even though they may not represent well in top individual sales items.
For the next Basket Analysis post see Basket Analysis II.


Thursday, November 20, 2014

Yikes - A read query is blocking everything!

This is something I saw in an OLTP database with an application (.NET) read query scanning a multi-billion row table while holding a full table S lock. The query thought it was using the default isolation level of read committed. Almost every other query in the system was locked in a chain behind this read query as all update queries on the table (arguably the most important table in this OLTP application) also prevented any other read queries from jumping the queue and had them waiting for a lock too.

By default, SQL Server read queries will use a locking isolation level of READ COMMITTED. Most developers just assume this and, understandably, do not bother to set isolation level. With READ COMMITTED, a read query will only lock the current record while scanning many rows. It has an "intent" lock on the table, which is compatible with all other "intent" locks on the table. However, if you are using connection pooling, which most applications do nowadays, it is possible that your query has picked up an isolation level of REPEATABLE READ from a recycled connection. This is quite serious. Now your query won't release locks as it scans through rows, and once it has read a few thousand rows, it will stop taking row/page locks and lock the entire table.  This is called Lock Escalation, as you have taken so many locks on one object, SQL Server says, "that's enough you may as well have one lock on the entire table". Now it gets interesting. Any INSERT, UPDATE, or DELETE query will have to wait on your query, since they are incompatible with your S (share) lock on the table. What happens next is even more interesting (in a nefarious kind of way). All subsequent read queries (except NOLOCK read queries) will wait for a lock too.  They are compatible with the full read lock but they are incompatible with the IX locks of the insert, update, delete queries, so they must wait in queue. I call this the "no queue jumping" rule.

How exactly does your query get a REPEATABLE READ isolation level when it didn't ask for it? The answer is connection pooling. An application program can change the isolation level of the connection and leave the changed isolation on the released connection. The next thread to pick up the connection will have the isolation level left behind. Many people have raised this issue before and there has been a Connect issue supported by many developers.
http://connect.microsoft.com/SQLServer/feedbackdetail/view/243527/sp-reset-connection-doesnt-reset-isolation-level
Unfortunately, I feel, Microsoft's response has been to close the "Bug" with an "as By Design" flag!

So what can we do to avoid queries escalating to full table locks when they assume isolation of READ COMMITTED?
  1. The cleanest solution would be for the connection pool to pass on used connections with their default settings. This is unlikely to happen in the near future as indicated above by Microsoft. 
  2. Yould could disable connection pooling. This is quite heavy handed and you will pay for it in a slower application as new connections generally take 100's of milliseconds to create.
  3. In your application that uses connection pooling, ensure that all connections are reset to the default isolation level before release. Note, sp_reset_connection does not do this. Also note, there might be times when a piece of code does not follow this standard. Unfortunately, the non standard code won't be affected.  It will be the innocent code that takes on the connection that initiates the locking issue.  
  4. In your application that uses connection pooling, ensure that all new connections are set to the desired isolation level. Ie, don't rely on a new connection having the default settings. This is probably a good practice, as all code that follows this standard will be OK.
  5. Use WITH (NOLOCK) table hint on your read queries. This will certainly avoid the lock escalation. However, there can be interesting data integrity issues in reading with no locks, such as reading the same row twice. I think there is a place for NOLOCK, but I wouldn't advise the use NOLOCK throughout your application. The problem is the use of REPEATABLE READ not READ COMMITTED.
  6. Use WITH (READCOMMITTED) table hint on your read queries. This will override REPEATABLE READ isolation level on the connection for this table. So the problem above is avoided, but it is a bit of a task changing all the queries to use this table hint. Note, query hints have a bad rap, which is understandable when they override the optimizer's choice of access path. I see no problem with query hints when they are requesting a specific isolation level.
  7. Use READ COMMITTED SNAPSHOT ISOLATION on your database. This is not a solution as the read query is on a connection that has a REPEATABLE READ isolation. As the name suggests, READ COMMITTED SNAPSHOT ISOLATION only works on connections that are using READ COMMITTED
  8. Use TRANSACTION COMMITTED SNAPSHOT ISOLATION on your database. This is not a solution either, as it does not override the REPEATABLE READ isolation level of the read query.
  9. Wait for upgrade to SQL Server 2014.  It looks like the isolation levels are reset on pooled connections.  However, you should check, as this information has simply be posted in an msdn forum, albeit from a SQL team member. 
How can I determine if REPEATABLE READ is being used by my application?
  • A SQL Server Profiler trace on ExistingConnection events will provide the isolation level of all active connections. Just look for set transaction isolation level and see what follows. If any of your current connections have REPEATABLE READ you should be concerned. ExistingConnection will only display currently active connections. To see if isolation level is changed, you will need to trace SQL:BatchCompleted and SP:StmtCompleted, and Prepare SQL. Looking for the text '%set%transaction%isolation%level%repeatable%read%'.
  • Another useful Profiler trace is to trace Lock:Escalation. I believe that any table lock escalation in your OLTP environment should be of some concern to you, particularly if you have locking or deadlock issues. However, if you have too many to look at, you could filter in/out by Type=5 (Object), objectId (table) and/or Mode=3 (S-share).
How lucky are we? There are two ways of viewing the locking incident.

  1. Unfortunate that it happened and get a quick fix asap for this particular query.
  2. Fortunate that this relatively small incident occurred, which enabled us to uncover something quite serious (full table locks in an OLTP environment). The issue, has given us an opportunity to discover the dynamics of connection pooling and helping us find an application wide fix to prevent its future occurrence. Also, if there is a future locking incident, we have more knowledge to diagnose that too. I see this as analogous to someone having high blood pressure and going to the doctor for a fix.  The high blood pressure could be seen as annoying and simply needing some pills to lower the blood pressure, or it could be an opportunity of determining the cause of the high blood pressure (be it diet, stress, etc.) and rectifying the underlying cause.








Wednesday, November 19, 2014

Relative Dates - Today, Yesterday, LastWeek etc.

A useful little piece of functionality I often add to OLAP cubes is what I call Relative Dates.  In the Date dimension I add attributes CurrentDay, CurrentWeek, CurrentPeriod, CurrentYear etc.  These attributes are simply functions within the Dates dimension view, which gets processed early every morning. For example CurrentDay could be something like
Case
when d.MyDate=convert(date,getdate()) then 'Current Day'
when d.MyDate=convert(date,dateadd(dd,-364,getdate())) then 'Current Day LY'
when d.MyDate=convert(date,dateadd(dd,-1,getdate())) then 'Yesterday'
when d.MyDate=convert(date,dateadd(dd,-365,getdate())) then 'Yesterday LY'
else 'Other days'
end CurrentDay,

Note, Current Day Last Year is -364 days, which will ensure that it will be the same day of week. It's the same as saying -52 weeks.

Note, you only need to process the Date dimension, the large fact tables do not need to be reprocessed, as their Date dimension keys are not changing. It's just the, flexible, attributes of the Date dimension that are changing daily.

Note, it is not practical to create a hierarchy of these attributes as they don't fall within oneanother. For example, Today and Yesterday might be in the same fiscal week, or they might fall in consecutive weeks.

The logic for CurrentWeek etc is a bit more complicated as your case statement will be checking if the FiscalWeek of getdate() is the same as the FiscalWeek of MyDate. Eg.
case
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,getdate())) then 'Current Week'
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,dateadd(WW,-52,getdate()))) then 'Current Week LY'
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,dateadd(WW,-1,getdate()))) then 'Last Week'
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,dateadd(WW,-53,getdate()))) then 'Last Week LY'
else 'Other weeks'
end CurrentWeek,

You can do the same sort of logic for Periods, Quarters, and Years. Note, this functionality does not replace Time Intelligence. If anything it adds to it as Time Intelligence will go across the Relative Dates attributes, or, more likely, you can use the Relative Dates attributes to filter dates and use a regular date hierarchy on rows/columns with Time Intelligence. The advantage of the CurrentDay, CurrentWeek, CurrentPeriod, and CurrentYear attributes is that reports can be relative in time. Ie, the filter can be a multi-select of this period and last period, showing weeks on columns, which will have 5-9 weeks.  So the report will always display the last 5-9 weeks, without the user having to select the current period etc. These filters will work in Reporting Services, Excel, PerformancePoint, or any other cube browser/reporting tool.

Wednesday, September 3, 2014

SQL Functions and Plan Caching

I feel that I have become typecast as a Business Intelligence specialist.  However I often employ my SQL skills and I still enjoy the challenge of solving pure SQL performance issues.  Here's an issue, which presented itself as a table valued Function in an OLTP environment that was taking a lot of compilation energy (6.7 seconds) every execution, while the execution was only 0.15 seconds.  It was executing about 1/second at peak.

Now, I am a great fan of functions, but I know that they are not compiled like stored procedures, so I am weary of using them for complex SQL queries.  This query was very complex.  I've got the function's code below, but it is based on a much more complex view, which joins 5 tables.  There is a picture of the function execution above.  See how the execution time is only 125ms, while the parse and compile time is 6638ms!  The obvious solution is to convert this function to a stored procedure, which will enable SQL to store the plan.

However, this would involve changes to .NET code, which would involve a lot of testing and change control.  I found another solution, which enabled us to fix the Function without having any change in .NET code, permitting a quick fix to production.  The fix was to alter the function so that it has multiple lines!  This way SQL Server will treat the 'multistatement table-valued function' as a stored procedure and enable it to be cached.  You can see from the execution results below that their is no parse and compile time, and the execution time is the same 125ms.
How does it do this?  It appears that SQL Server will treat the multistatement table-valued function as a stored procedure.  We can see this from the plan cache query below.


















Note, SQL Server is able to cache Adhoc queries, so that your execution of a function with a particular parameter can be cached.  Don't let this cloud your analysis, unless your OLTP environment is going to  execute the function with exactly the same Adhoc text.  You will see these 'Adhoc' cached plans above with objtype='Adhoc'

Note, there is documentation that explains this behaviour, although you might have missed it.  See http://msdn.microsoft.com/en-us/library/cc293623.aspx and read the text near 'multistatement table-valued functions'.

What does this tell us?  Firstly, it reminds us that compile time is not always trivial.  Secondly, it reminds us that we should treat Functions as different to Procedures, and only use Functions where they are appropriate.  I find them most useful where they are doing something like string manipulation and optimisation is not appropriate.

Old function (with 6.8 second compile time every execution)
create function MyFunction 
(@Parameter1 int) 
RETURNS TABLE AS
RETURN (
SELECT t2.*,t1.MyDate
from (select max(t4.TheDate) MyDate, 
  t3.BranchNo,
  t3.TranCode, 
  t3.Cash1, 
  t3.Cash2,  
  t3.Cash3, 
  t3.Cash4
      from vBranchCount as t3 
      left outer join  (SELECT *
                        FROM vBranchCount
                        WHERE vBranchCount.Status = 'Applied') AS t4 
ON t3.TranCode = t4.TranCode 
                AND t3.BranchNo = t4.BranchNo 
                AND t3.Cash1 = t3.Cash1 
                AND t3.Cash2 = t4.Cash2 
                AND t3.Cash3 = t4.Cash3 
                AND t3.Cash4 = t4.Cash4 
                AND t4.TheDate < t3.TheDate
      where t3.StockCountPlaceId = @Parameter1
      group by t3.BranchNo,t3.TranCode,t3.Cash1,t3.Cash2,t3.Cash3, t3.Cash4) as t1
inner join vBranchCount as t2 
  on t1.BranchNo = t2.BranchNo
 and t1.TranCode = t2.TranCode
 and t1.MyDate = t2.CashDate
 and t1.Cash1 = t2.Cash1
 and t1.Cash2 = t2.Cash2
 and t1.Cash3 = t2.Cash3
 and t1.Cash4 = t2.Cash4  )

New Function with 6.8 second compile time only on first execution.
create function MyNewFunction 
(@Parameter1 int) 
RETURNS @t TABLE
(  TeamID bigint,
           Rain bit,
           WitchHat varchar(50),
           JoyStormID bigint,
           Happyness int,
           TeamName varchar(250),
           CashIn money,
           CashOut float,
           TheDate datetime,
           JoyClerkID bigint,
           JoyClerkName varchar(150),
           TranCode bigint,
           Bills float,
           CashDate datetime,
           BranchNo bigint,
           Temperature varchar(50),
           StockCountPlaceId bigint,
           Applicability int,
           Readyness int,
           SiteID bigint,
           BillsToChange int,
           Cash1 varchar(50),
           Cash2 varchar(50),
           Cash3 varchar(50),
           Cash4 varchar(50),
           JoyUnit1 int,
           JoyUnit2 int,
           JoyUnit3 int,
           JoyUnit4 int,
           TranSystemId bigint,
           Handle1 varchar(35),
           Handle2 varchar(35),
           Consumer varchar(50),
           JoyConsumerId bigint,
           Rainbow varchar(50),
           StormCode varchar(50),
           ChangeDate datetime)
 AS
BEGIN  
INSERT INTO @t
SELECT t2.*,t1.MyDate
from (select max(t4.TheDate) MyDate, 
  t3.BranchNo,
  t3.TranCode, 
  t3.Cash1,
  t3.Cash2,
  t3.Cash3, 
  t3.Cash4
 from vBranchCount as t3 
 left outer join  (SELECT *
FROM vBranchCount
WHERE vBranchCount.Status = 'Applied') AS t4 
ON t3.TranCode = t4.TranCode 
AND t3.BranchNo = t4.BranchNo 
AND t3.Cash1 = t3.Cash1 
AND t3.Cash2 = t4.Cash2 
AND t3.Cash3 = t4.Cash3 
AND t3.Cash4 = t4.Cash4 
AND t4.TheDate < t3.TheDate
 where t3.StockCountPlaceId = @Parameter1
 group by t3.BranchNo,t3.TranCode,t3.Cash1,t3.Cash2,t3.Cash3, t3.Cash4) as t1
inner join vBranchCount as t2 
 on t1.BranchNo = t2.BranchNo
 and t1.TranCode = t2.TranCode
 and t1.MyDate = t2.CashDate
 and t1.Cash1 = t2.Cash1
 and t1.Cash2 = t2.Cash2
 and t1.Cash3 = t2.Cash3
 and t1.Cash4 = t2.Cash4   

RETURN
END

Query plan cache

SELECT usecounts, cacheobjtype, objtype, text 
FROM sys.dm_exec_cached_plans 
CROSS APPLY sys.dm_exec_sql_text(plan_handle) 
WHERE (text like '%MyFunction%' or text like '%MyNewFunction%')
and objtype <> 'Adhoc'
and cacheobjtype='Compiled Plan'
ORDER BY usecounts DESC;

Wednesday, August 27, 2014

Database IO Throughput

I've written on this before, but I feel I need to say it again in a different way as so many people fail to grasp the significance of IO throughput capacity.  If you see messages in the SQL log along the lines of the following, then you have a very very serious IO throughput issue. I didn't make these logs up.  They are real logs!
The best analogy I can draw is if you can imagine your company has 1000 staff and you have a building of 20 floors with 4 lifts that can adequately house all staff.  Your 1000 staff tend to drift in between 9 and 10AM and the 4 lifts can move the 1000 staff to their desired floors in the hour.  Infrequently, staff will queue for a lift as the first lift will be full, but it's rarely longer than one lift ride for the wait.  During the day, the lift demands are lower and there are no queues.

Now, over time, your company grows to 2500 staff, and you've added another 30 floors to your building and the lifts are extended to cater for all 50 floors.  The CEO is relieved that the building capacity has been increased to house the new staff and that the lifts work across all 50 floors.  However, no one has thought about the increased throughput required by the lifts.  Between 9 and 10AM 2500 staff now drift into the office and want a lift to their floors, however the 4 lifts are only capable of moving about 1200 staff/hour.  The queues for the lifts build up and are not fully drained until after 11AM.  The company is now in trouble with disgruntled staff and find it needs to ask staff to come in earlier or later.  Also, there is the hidden issue of the fire exit stairways also having throughput limits.  With 1000 staff, the building could be evacuated in 15 minutes, now with 2500 staff (and the same stairways) it will take over 40 minutes.

This is analogous to a SQL Server database growing over time, and the SAN (playing the role of the lifts) is extended to hold more TeraBytes, but its throughput capacity (measured in MegaBytes/second of reading and writing) is not increased.  The IO queues start to build up and IO times degrade.  The SQL Log picture above is an example where hundreds of IO every minute are taking longer than 15 seconds to complete!  This is an unbelievably dire situation that can cause SQL Server to shutdown. 

I am not advising that you monitor your SQL Server log for long running IOs, although if you see them, you should act immediately.  I advise you to monitor IO throughput, average latency, and queue length continually, using Perfmon or related tool.  You can detect throughput capacity when MB/second tends to flatline while IO latency and queue lengths are waxing and waning.  Unfortunately, Perfmon, is not a complete tool, but it does have all the base information and there are many third party tools, or you can do what I do and create your own by having Perfmon send records to a common SQL Server database and cube.  You only need a few counters and only record every 60 seconds, or so.  If the Perfmon database is on another server, there will be very little load on your production servers, essentially no IO.  Your Perfmon database could hold performance data for all your production servers, so that you have all the performance information you need in one place.  Also, when you have a performance issue you have historical information of this issue and trends earlier in time.  Companies that don't continually record these metrics are flying somewhat blind, or, at least, flying with blinkers on.  Here's an earlier blog on creating a Perfmon cube, although, unfortunately, the online demonstration is no longer available.  

This posting is not intended to provide answers to the IO throughput issue.  The solution can be many and varied.  The solution will probably involve speeding up the storage system, but it might be to make the database more IO efficient.  The most important point I can make is that you should be monitoring and be able to recognise this issue.  Also, don't let a storage person tell you that you have hit the maximum possible throughput.  There are always ways to increase the throughput performance of your storage, as there are always ways to improve your database performance.



Monday, June 30, 2014

Signs your Data Warehouse is Heading for the Boneyard

I have come across too many data warehouse projects that have obvious signs of failing with project sponsors in denial.  Here are a few signs that you might look out for and take seriously.

  1. You've spent one million dollars and haven't got one million dollars of value. Data warehouse (DW) projects are never really finished, as there are continual changes in business practices, business software, competition, business strategy, regulation, etc. DWs need to be flexible so that they can continually be updated/enhanced to adapt to these changes in the business. If you've spent one million dollars and not got value from that, you are probably developing something that is inflexible. Typically, I would be encouraging, even very large companies not to spend that much money without showing benefit. The vision for a DW might be very large, but start with the trickiest, largest, and/or most crucial data. The DW should manage that data and provide value to the business before more work is commissioned.
  2. Your project is 3 months overdue and the development manager is telling you that more resources are required. There are probably fundamental design issues that will continue to cause issues. Such a long development cycle will probably lead to an inflexible DW.
  3. Your user community do not trust the output. The DW is producing reports, dashboards, alerts, etc., but the user community do not trust the data and aren't using it to run the business.
  4. The same question can be asked to different parts of the warehouse with different answers. One transaction detail should, ideally, be stored in only one place in the DW. If it is stored in mulitple places, the design is probably not optimal, and if they are providing different answers, there is definitely a problem.
  5. It's almost impossible to reconcile the data warehouse with your OLTP systems. The ETL (extract, transform and load) processes are so complex and involve so much custom code that no one can reconcile DW facts with OLTP systems. The ETL processes are probably too complex.
  6. Your developers are telling you that they are spending a significant amount of time developing functionality that will make the data warehouse more flexible. This is all too common in my view. Developers like to develop, and they can create large generic applications to schedule jobs, "clean data", manage file transfers etc. Most generic DW software is already available in Microsoft's SQL Server and related technologies. The role of the developers is to use this software to create a bespoke application that is as simple as possible. Try and stick to existing vendor software. They will maintain it and provide future releases. You will always have the option to move platforms in the future having spent much less money.
  7. The data warehouse requires a full-time person to manage the jobs and exceptions. DW schedules should be reliable enough to manage themselves, simply providing alerts for late data etc. Intervention should only be required where there is a defect in the DW and a diagnosis and change is required. Even when there is a catastrophic failure, the DW should start up and simply continue from where it was.
  8. Some of the development team have left before value is delivered. DW projects should be delivered in small releases that provide value. If the release cycle is so long that significant team members are leaving, there is probably an underlying design issue.
  9. There are performance issues that are so severe that it is looking difficult to process one day's data in a day. This is a bright red flag.  DW will only want to grow with more feed systems and more data. Such a large performance issue should be seen as a huge question mark over the DW.
  10. The development team are telling you that they need to spend a lot of time up front to develop software that's going to make the DW more flexible! It's almost a certainty that their software will make the DW more inflexible. Think of it this way, if it takes 50 dev days to create a DW, it will take a proportion of that 50 dev days to make a change.  If the first deliverable takes 1000 dev days, enhancements will tend to be a proportion of 1000 days.


Monday, January 13, 2014

Determining a wayward query plan while it's executing

Occasionally, I get production performance issues of a long running SQL query, that is consuming lots of system resources and is running for much longer than usual.

You can see the long running query simply by using sp_who2, or some variant, that displays active SQL queries.  So you now have the SPID of the running query that is taking a long time.  I would typically then want to get the SQL command.

You can easily get the SQL for this query by using dbcc inputbuffer(SPID).  I might then use the sql query to get a show plan to see how it should execute.  You can quite easily do this by cutting and pasting the SQL into a new window and requesting a show plan.  However, sometimes the query plan that is running is not the same as the query plan that SQL Server tells you it would use now.  This can be for a variety of reasons including new statistics, parameter sniffing and different connection properties.

So what you really want to get is the plan that is currently executing.  You can obtain this from the sys.dm_exec_query_plan table.  First of all you will need the plan handle, which you can get from sys.dm_exec_requests using the SPID as the session_id.

Use the plan_handle to get the plan from sys.dm_exec_query_plan.  The query plan will return as an xml document.  Simply click on the xml in the query results window for SQLEM to open the query plan in a new window.

Now you have the query plan that is actually executing and you can compare that to the query plan you expected SQL to use.  From here you can tune the query to use a more appropriate query plan.

Note, I would encourage you to use all other options before using index or join hints to get the query plan you desire.  This is because once you override the optimiser with your index or join hint, it will always use that hint even when your table structures or data changes so much that it is obviously silly.  Optimiser hints such as FASTFIRSTROW, redundant predicates etc are OK with me, as they don't override the optimiser, they just tell the optimiser that you want a plan that will get the first row fast.

Saturday, August 10, 2013

Sargable Predicates are usually best practice

Every SQL query writer should know that sargable predicates are generally much faster than non-sargable predicates.  Essentially a sargable predicate can be used by the relational engine to navigate an index.  For example, the predicate Where IpCountryID=@i 

is sargable as the relational engine can use the variable @i to navigate an index on IpCountryID.  Similarly
Where IpCountryID=@i-1

is also sargable as the relational engine can add 1 to @i and use that to navigate an index on IpCountryID.  On the other hand
Where IpCountryID+1=@i
is not sargable as the relational engine cannot navigate an index while there is an arithmetic expression on IPCountryID.  These equivalent sargable and non-sargable predicates and their plans can be seen above.  The sargable predicate query uses an index seek to get the particular record, while the non-sargable predicate uses the same index but it scans the entire index checking if IPCountryID+1=@i.  An example of how the sargability of predicates is important is where you are joining a table to itself where the key has an arithmetic function.  For example, getting the previous record as in the following example
SELECT t1.*,t2.value
FROM MyTable t1
LEFT JOIN MyTable t2
ON t2.id=t1.id-1
Notice how the predicate t2.id=t1.id-1 is sargable on the left side but not the right.  This means that the relational engine can navigate to t2 using an index.  If the predicate was t2.id+1=t1.it (logically equivalent) the relational engine could not navidate to t2 using an index.  

I found an interesting situation while tuning a query on a couple of nested views.  It's not easy to reproduce, and to reproduce it I need to include other joins and logic, which will make it hard for you to comprehend.  The essence of the first view was to restrict the records to the last rolling 53 weeks.  It had the following predicate
WHERE TransactionDate >= DATEADD(WW,-53,GETDATE())
This predicate is sargable, which should be a good thing.  However, there are several views on top of this base view that are asking for month date ranges. So they have predicates with hard coded date ranges (on top of the first view) ie
WHERE TransactionDate between '20160101' and '20160131'
This predicate is also sargable, which should also be a good thing.  However, the issue I uncovered was that the optimiser now had a choice of TransactionDate>=DATEADD(WW,-53,GETDATE()) and TransactionDate>='20160101' as the lower bound to the index seek.  Sometimes it was choosing TransactionDate>=DATEADD(WW,-53,GETDATE()) as the lower bound, which is much slower than TransactionDate>='20160101', since 20160101 is a higher date (and there are no transactions for 2016 yet).  The solution to these poor performing queries was to change the first view to make the sargable predicate in the base query non-sargable.  It became 
WHERE DATEADD(WW,+53,TransactionDate) >= GETDATE()
Now the optimizer has no choice, it will use '20160101' and '20160131' as seek predicates on the index and the query runs orders of magnitude faster.

Please don't takeaway the feeling that non-sargable predicates are good.  The takeaway is that it is worthwhile looking into the plans of long running queries, understanding why the optimizer chose that path and help it choose a faster path, if one exists.