Tuesday, July 19, 2016

Guiding Principles of Data Warehousing

I know Kymball and Inmon have written vast texts on data warehousing. So much in fact that you can find literature to support or criticize almost any design. On this page I want to give my 6 Guiding Principles that I use when designing a bespoke data warehouse. A corollary to my principles is that no particular design feature is bad, it can only be deemed appropriate/inappropriate in a particular implementation.

1. No Loss of Data
All information taken from source systems should be held without losing information. This might sound simple, but there can be subtle losses of data by omitting columns, trimming columns, correcting data, denormalising, transformation, cleansing, etc. The test is any question that the source system can answer (within the tables extracted) should also be answerable from the DW (and with the exact same answer).

2. Keep it Simple
Minimise physical layers of transformation that make it difficult to determine data at source. A common question will be “why does this record have that value in the DW?” The simpler, and less physical, the transformation, the easier it is to reconcile DW data with source systems. By making reconciliation easier, we will be able to build trust that the DW is valid and can be relied on to give a “correct” answer. If there is a bug in the ETL, it will be easier to uncover in a simple DW. Once a view of the data is materialised in the DW, it becomes more difficult to ascertain where the data came from and which insert and update statements have acted on the data.

3. Ease of Enhancement
This is closely related to the second principle in keeping it simple. The simpler the architecture and minimal physical transformation, the easier it will be for future developers to enhance the DW to add/extend data sources or perform new analysis. DWs are invariably never finished as changes in business activity, regulatory requirements, enterprise systems, competitive action, supplier action, etc. will drive new DW information and analytical requirements. The best indication for ease of enhancement is how much development effort did it take to create? Enhancements will be a proportion of the initial development effort. A DW that took 10 developer years to create will take 10 times the effort to enhance than a DW that took 1 developer year to create. Even if both DWs contain the same information.

4. Historical Integrity
If the DW users require it, historical integrity should be maintained so the DW can report on what the values "were" at the time of the transactions (facts). While it is important to maintain this information, it can be expensive, so it should only be maintained on tables that require it and only on columns of these tables that are worthy of it.

5. High Performance and Room for Scale
The DW should be able to perform well with the anticipated volumes. There is no excuse for slow queries that were known in advance along with anticipated table sizes. Of course, as the DW grows in unanticipated ways and new queries/analyses are written, further optimisation will be required. The initial design should take into consideration the size of the tables and how they will be joined with related tables. This will lead to some large tables often joined having similar clustered indexes, and appropriate normalisation, partitioning etc. These considerations are often very easy to implement at the beginning, but become more time consuming once implemented and built upon.

6. Appropriate Design
Just like building architecture, there is no one design that suits all. The design should be appropriate for the purpose. We won’t follow others’ design guidelines blindly.

Saturday, July 16, 2016

Don't Do Time, Do The Distance


Every now and again I take this blog off-road with a reminder that there is more to life than work. This is one of those times.

My friends all know that I enjoy exercising - everyday. The reason I enjoy it is that I do the distance, not the time. So my big tip for anyone wanting to enjoy fitness and mental health is to do the distance, not the time.

There is a subtle distinction between time and distance. Doing the time, is exercising for a preset time, whereas doing the distance is completing a preset distance (or task). You might end up getting the same exercise, but the difference in enjoyment and satisfaction is huge.

By doing the distance, you are thinking about achieving something. Whereas doing the time, you just wait for the time to pass and it drags on. Just as the least enjoyable jobs are where you follow the clock, exercise by time is a drag. The most enjoyable jobs and exercises are ones where you achieve something, semi-regardless of the time.

Also, while you are working to achieve the distance, you will tend to run/swim/peddle/skip etc. more efficiently. This comes naturally as you aim to get to the finish with the least effort. You will try out different techniques and learn a more efficient styles. All the while, getting there in a more graceful way each day. This is a positive feedback loop where you get the benefit of getting faster and stronger by finishing with less effort.

Arthur Lydiard understood this all those years ago when he had his Olympic running hopefuls repeatedly run over a preset course in the Auckland hills. In my seaside village there are hundreds of swimmers who take to a 1.2km ocean course every day. They love it, and a big part of it is that they are achieving their goal every day. As each day passes, they find it easier and more graceful than the day before. Even if your preference is going to the gym, try setting the activities and reps in advance, rather than staying there for x minutes.

So, get out there and do the distance by walking home, to the next station, or to your favourite café. Each day you will get healthier and stronger with a sense of grace and achievement.

Monday, July 4, 2016

Relative Dates, Today, Yesterday, This Week, Last Week etc

A common requirement in Business Intelligence is for reports on relative dates, such as

  • Today,
  • Yesterday
  • Same Day Last Year
  • This Week
  • Last Week
  • This Period
  • Last Period
  • This Year
  • Last Year
Where these are dates relative to today on the fiscal calendar. It is possible to create these members on the Time Intelligence dimension. However, since they are so widely used, and often in combination with other Time Intelligence members, there is better performance and better functionality if these members are in the time dimension. 

Putting them in the time dimension is relatively easy, all you need is an existing time dimension that has Date as a level of granularity and the ability to use a view (or the SSAS dsv) to create additional date attributes.

In your Date view, simply 
  1. Add in a column for RelativeDay, another column for RelativeWeek etc. This column will have the "relative day" value. For example, 
    • CASE 
    •  WHEN CONVERT(DATE, GETDATE())=MyDateTable.TheDate THEN 'Today'
    •  WHEN CONVERT(DATE, GETDATE()-1)=MyDateTable.TheDate THEN 'Yesterday'
    • ELSE 'Other Days'
    • End RelativeDay
  2. Add in additional WHEN clauses for Day Before Yesterday, Same Day Last Year, Yesterday, Last Year etc if you want them. 
  3. Add an attribute for each relative date into the Date dimension. One for RelativeDay, one for RelativeWeek, one for RelativePeriod etc.
  4. Then, all you need to do is ensure that the Date dimension is updated early every morning. You are probably updating your dimensions many times per day. By default dimension attributes are flexible. These attributes need to be flexible as they are changing, literally every day.
From then on, cube users and reports can use these relative dates. For example, a classic use would be to have a report that filters on This Week and Last Week, and shows 8-14 columns, one for each day in the last 1-2 weeks. Whenever a user requests this report the default showing will be for the last 8-14 days, without the user having to select the actual date.

Adding these relative dates will not increase the size of your cube. The date dimension will be marginally bigger. These relative dates are easier to use than Time Intelligence and perform much better. Especially better when you want relative dates and Time Intelligence.

Saturday, April 30, 2016

I Don't Know

These are 3 words I like to hear.

It means the person has the confidence to say there is something they don't know. Surely, everybody has things they don't know? However, in my industry (computer software), it appears that there are many that find "I don't know" (IDK) as tough to verbalise as "I love you". Why do technicians have trouble saying IDK? Is it insecurity or a fear of appearing ignorant?
 

Making up answers is unhelpful. It can lead to poor decisions. By hearing IDK, it gives you the chance to research the correct answer before taking action. A simple analogy would be asking a colleague if it's raining outside. If your colleague can't say IDK and gives you their best guess, they might be right. But if they're wrong you will get wet or take rain gear unnecessarily. The more useful response would be IDK, or "IDK - but it was sunny when I was out an hour ago".


I am happy to seek information from people who can say IDK. People who make up answers are the ones we avoid asking in future. I found early on in my career that saying IDK can also reduce tension. You are saying that t's OK not to know. I typically followup with "but I want to know and will find out for you".

It takes a self-assured person to say 'I don't know', so let's all be self-assured and more helpful..

.

Wednesday, March 16, 2016

Memory is Oxygen for Large Database Systems

I am frequently invited in to assist with database (OLTP, DW and OLAP) performance, and often the issue is IO throughput related. So, the solution tends to involve tuning queries (and indexes) to reduce IO or improving the IO capacity of the SAN (or both). Of course, in all of these IO related performance issues, RAM can be a great elixir. Databases love memory. They are designed to use RAM by holding popular pages, compiled plans and work tables in memory. So, if you have a large database with a performance issue; very quick, and relatively cheap relief might be gained simply by adding RAM.

SQL Server is particularly good at exploiting available RAM - and (just as important) reducing memory demands if the operating system is paging. With the default SQL install options, all you need to do is give RAM to the OS and SQL will use it to hold more data and plans in memory. Sometimes this can be an exponential beneficial effect since, by reducing IO demands, the SAN is no longer a bottleneck and remaining IO is executed much faster.

Multidimensional OLAP databases also like to have lots of RAM. Even larger than the combined OLAP database size! See earlier blog. Unfortunately, SQL Server OLAP is not as memory effective as the relational SQL, but it will hold data (and aggregations) in memory, improving query and processing performance. SQL Tabular is an exception, as it requires the entire cube to reside in memory. It will simply not work without sufficient memory.

RAM, by and large, has no licencing cost with SQL Server, which can be a big consideration. Adding CPUs will generally incur significant OS and SQL licensing costs. I said, by and large, since with SQL 2012 and SQL 2014, there are, unfortunately, limits on SQL Server Standard Edition memory.
  • SQL 2008 Standard Edition has no memory limit
  • SQL 2012 Standard Edition has 64GB memory limit
  • SQL 2014 Standard Edition has 128GB memory limit
  • (SQL Server Enterprise Edition versions support their OS limits)
So, if you have a performance issue with a large database, and can't wait for a database tuning expert; before adding CPUs, Fusion IO cards etc.; try giving your system more RAM.

Monday, January 25, 2016

Basket Analysis Made Easy

This article is aimed at technologists who hear about Basket Analysis and need a simple explanation with examples. Basket Analysis used to be the preserve of high cost specialist providers. Now we have OLAP technology and hardware that makes Basket Analysis available at commodity prices. Everyone involved in retail analytics should understand Basket Analysis. Essentially, Basket Analysis supports analysis of sales by the combinations of items purchased.

To elaborate, I will provide two classic examples of Basket Analysis.

1. Menu Trim
The business would like to remove a couple of low value items from the menu (product list). A smaller menu generally supports lower operational cost and inventory investment. Candidate items might be the lowest by sales value or gross profit contribution. We observe that the lowest value items are Frozen Yoghurt and Fruit Salad, which we can find by a simple slice of items and their sales/gross profit. However, this does not take into account whether customers come in to buy the product on its own, or whether they are buying lots of other products too. We can see from the Basket Analysis query below, that Frozen Yoghurt customers buy lots of other products with their Frozen Yoghurt. Basket Analysis can sum all the products the customer buys with Frozen Yoghurt, which we call Affected Net Sales. This suggests that we can remove Fruit Salad from the menu we only risk the $15,414 of sales, but if we remove the Frozen Yoghurt, we risk $1,899,151 of sales. We say that it is "at risk" because the customers might be coming in especially for the Frozen Yoghurt, and purchasing other products while there. Without Frozen Yoghurt these customers might go elsewhere for their entire basket.





2. Attracting Large Value Transactions
Your business might have a strategy to increase sales by promoting menu items that tend to be involved in large value transactions. Basket Analysis can help you do this by identifying menu items (product items) involved in large transactions. In the example above, Avocado Toast has a high transaction average of $105, while the average transaction for all items is $45. So, the business strategy might be to discount and/or promote Avocado Toast to attract these high value customers.The marketing geniuses will work out a way to attract these customers, which might be by discounting Avocado Toast, offering a discounted side item, promoting quality/health features, inter-company coupon, etc. Once the campaign starts, Basket Analysis can very succinctly report on the success of the campaign. One of the key indicators will be the transaction average of Avocado Toast transactions during the campaign. The sales of Avocado Toast might increase two fold, but if the Transaction Average is halved, you are probably only attracting customers who purchase Avocado Toast alone. Ie, you are not attracting customers with large transaction value. In the chart below, we can see that the Avocado Toast promotion has been successful in increasing Avocado Toast transactions, however, these additional transactions have been small value transactions. The marketing geniuses might need to try a new strategy!



I hope this has shown how Basket Analysis can help business' create better strategies and measure how well these strategies are working. This is just two simple examples of an infinite number of possibilities, only limited by your creativity. Basket Analysis has become a commodity feature in retail analysis and is becoming an essential tool for successful retail strategies.

Thursday, August 27, 2015

Supporting Multiple First Day of the Week Calendars in a Cube

A not uncommon requirement is to support weeks with different start dates. For example, the Finance team might want to see weeks as Tuesday through Monday, while a store manager A might prefer to see weeks from Monday through Sunday, and store manager B might prefer to see weeks as Saturday thru Friday. Do not create new dimensions for this requirement. That is unnecessary and would lead to degraded query performance.
 
It is quite simple to support multiple week starting dates, without complicating your cube or degrading performance. You should already have a Date dimension, with one member per calendar day. That's all you need. Now, just add attributes to each day denoting the weekno and dayofweek for each of the, up to 7, week start days. Just to be clear, the number of records in your Date table/view will not change, just new columns.
In your existing date dimension table (or view in my projects) add in, up to, 7 new columns for WeekStartingMonday, WeekStartingTuesday, WeekStartingWednesday, etc, which will contain the WeekId for the week defined by the start date. For the first (Monday thru Sunday) week in the year, these days will all have a WeekStartingMonday key of 1. For the first (Tuesday thru Monday) week in the year, these days will all have a WeekStartingTuesday key of 1. To put it another way, each attribute will have a number between 1 and 53 denoting the week number. These 7 attributes will be subtly different as WeekStartingTuesday will have Tuesday as the first day with the new week etc. You could give these weeks captions as well, but not necessary at this stage. Note, if you want to have full weeks at the beginning and end of each year, you will need to add another 7 attributes, which will be the WeekStartingMondayYear, WeekStartingTuesdayYear etc. This attribute will have the year that the day/week is in (ie a number such as 2015, 2016 etc.) Then you can ensure that each year will have exactly 52 or 53 full weeks.
In the table/view for your date dimension, add in another 7 new columns for WeekStartingMondayDay, WeekStartingTuesdayDay, etc, which will have the DayOfWeekId for the day as it behaves in that week start. For example, WeekStartingMondayDay will always be 1 for Monday, 2 for Tuesday; WeekStartingTuesdayDay will always be 1 for Tuesday, 2 for Wednesday etc.
In your Date dimension, add the new 14 (or 21 if you include year) attributes. For each of the day attributes keep the WeekStartingMondayDay number 1-7 as the key, but provide the name as the dayofweek name (Monday, Tuesday etc.) Eg, the key might be 3 and the name might be 'Friday' for a WeekStartingWednesdayDay member. In SSAS specify the OrderBy attribute =Key. That way, when the attribute is browsed, SSAS will sequence days by the key, not the name. So, WeekStartingWednesdayDay will have Wednesday (key=1) as the first day, Thursday (key=2) as the second day etc. Your week attributes (WeekStartingMonday, WeekStartingTuesday etc) will be OK with the number 1-53 as the key and name. Note, ensure that OrderBy attribute =Key for the weeks also, otherwise, SSAS might perform a character sequencing (by caption) of your weeks.
Create 7 hierarchies, one for each Week Start day, with Week as the first level and day as the lower level. You may also want to add in Year at the top level of the hierarchy.
Now, your users can choose between one of 7 hierarchies for weeks, depending on which day they would like the week to start.
Users can also cross join one of these 7 hierarchies with Dates, then, they will see a heading with the date and the day of week underneath, and broken by weeks.
 
This solution will perform well in your cube. Since these attributes are simply attributes of existing members, the measure groups will not change in size with their addition. Also, these attributes will perform well as they are physical attributes, not calculations.

Wednesday, June 24, 2015

SQL Server Advanced Scan, aka Merry-go-round or Scan Piggybacking

I think SQL Server 2005 was the first SQL Server version with Advanced Scan. I have been excited by this feature, comforted by the thought my databases are performing better, but also a little unsure of its efficacy. According to the documentation, when one query is running a table scan, subsequent queries that require a table scan will "piggyback" on the first scan, avoiding additional scan IO. Apparently, any number of queries can all piggyback on the same scan. Once the first one completes, the others carry on from the beginning, as they may have joined in after the first scan started. So there should never be more than one scan of any one table at a point in time.

What a great feature? However, there are no Profiler traces for Advanced Scan, so how do you know it is actually working? I've just ran a performance test of Advanced Scan with 6 concurrent queries that all scan the same table (with slightly different predicates). I ran another test with one query running solo.

I was impressed with the results.
The table I used is about 10 times the size of the available memory, so SQL had no chance of caching the table. The solo scan ran as fast as the read ahead engines could provide the data. The multiple read engines (from the solo scan) kept the SAN storage busy with an IO queue count constantly greater 50 and reached a steady state of about 100MB/second. I ran this a couple of times to warm up the system. Note, this single query was running with MAXDOP=1. It would not run faster with MAXDOP=0 as the SAN storage was already working at maximum throughput with the workload of the multiple readahead engines.

I then ran 6 queries concurrently that all required a scan of the same table, but with slightly different predicates. These 6 scans all completed in a similar time as the first solo scan! That told me that Advanced Scan was working, but where could I see it? I could not find any evidence in Profiler (apart from the duration) that Advanced Scan was operating. In the queries I had SET STATISTICS IO ON. The statistics had the evidence that Advanced Scan was functioning. All 7 queries had similar logical reads, physical reads, CPU time, and elapsed time. However, the 6 concurrent queries had much smaller read-ahead reads. The total for the 6 concurrent queries was equivalent to the one solo scan.

It's unfortunate, that Profiler does not have an Advanced Scan event (or flag on scan). However, this test does demonstrate that the feature works and that it can be quantified.

Note: Advanced Scan is only available in Enterprise Edition. I initially tried to monitor this feature in a developer version of SQL Server, but it gave mixed results. Also note, the test above had Max Degree of Parallelism set to 1. I received similar results with MAXDOP set to 0.

Tuesday, June 16, 2015

Natural Keys are Healthy

It seems that I often attract suggestions from developers who have recently been reading Ralph Kimball. This week, it was a suggestion to add surrogate keys to all the dimension tables because this "would make the datamart more extensible and scalable"! I had to argue my preference to only use synthetic (surrogate) keys when there is a need. If the natural business key works and there is no need for SCD (slowing changing dimensions or Type 2, as Kimball would say) I would keep the design simple and use the existing natural key. That key is already maintained by the source system and business users are typically familiar with it. In this case the main dimension table contains book titles and has a natural key of ISBN. I think that since the ISBN is an International Standard Book Number, and its attributes don't need SCD support, it would make a perfectly good primary key of the titles dimension table. Also, the fact tables would have ISBN as a foreign key - just like the source system. This requires much less ETL logic and much easier reconciliation. No need to translate surrogate keys when determining which facts are related to ISBN=x. Why would you want to replace ISBN with a synthetic number, which is meaningless to everyone? I struggle with applications that are more complex than they need to be. I think Einstein said something along the lines of "Everything Should Be Made as Simple as Possible, But Not Simpler".

Some might argue that the fact table will be thinner with a numeric foreign key rather than the 13 byte ISBN. That is true, however I think it's risky to remove the natural key (ISBN) from the fact table as it makes reconciliation with ISBN difficult in the event of a bug in the synthetic key allocation. Also, the fact and dimension tables might like indexes on these keys, and with natural and synthetic keys, there might be a need for twice the number of indexes. Overall, I would argue that natural keys offer higher performance.

I have seen datamarts mystified by inappropriate use of surrogate keys. To illustrate, imagine the titles dimension constructed with a synthetic surrogate key and the ISBN no longer unique. This might be argued by a few to be superior as it would support historical integrity. (Hmm, like historical integrity is necessarily superior.) Think about a slowly changing ISBN! Someone enters a new title as "Kate the Cat", when it should have been "Katie the Cat". After a few weeks, it gets noticed and is corrected. However the SCD datamart is too clever and says that it was "Kate the Cat" for those first few weeks, so it must remain so, for historical integrity! From now on, users of this dimension must combine "Kate the Cat" with "Katie the Cat" to get sales information on this book. This can be frustrating for users, who know only one book called "Katie the Cat".

I have also seen datamarts where the dimension tables have blown out by more than 20 times (1 Million customers turns into a 20 Million row customer dimension) because all attributes are maintained with SCD logic. So each customer record had been updated an average of 20 times. Sometimes these changes are irrelevant to the business regarding SCD, such as a field containing the number of Customer Service calls. In this case, every time the customer calls up, there will be a new record in the customer dimension, with just the call count incremented by 1. For what purpose?

I have also seen date dimensions with surrogate keys! This also puzzles me. Does the developer believe that attributes of a date will change and that reports should show those dates with historical integrity? To my simple mind, that just seems to be a quick way to make a datamart/DW more complex for no benefit. Unless, there is benefit in increasing future work loads as maintenance and reconciliation become bigger chores.

I would add, that if you really do identify a need for SCD I would encourage you to only maintain SCD on the identified attributes. Do not apply SCD logic to all attributes blindly.

Please don't see this blog as anti Kimball. I truly believe that he has brought a lot of rigour and maturity to data warehousing. However, to follow his dimensional modelling blindly is not pragmatic and does not help organisations trying to build business intelligence.

Sunday, June 7, 2015

Fast CASE statements in MDX

Sometimes we need to create a CASE statement in an MDX calculation to split a measure value. There was a good example in this msdn forum. For a simple example
CASE 
     WHEN Measures.[MyCalc]=0 THEN 0
     WHEN Measures.[MyCalc]<=5 THEN 5
     WHEN Measures.[MyCalc]<=10 THEN 10
     WHEN Measures.[MyCalc]<=15 THEN 15 
     WHEN Measures.[MyCalc]<=20 THEN 20 
     WHEN Measures.[MyCalc]<=25 THEN 25     
...
     ELSE 100
END

What I have found is that Measures.[MyCalc] is recalculated for every WHEN clause until it drops out. So the CASE statement could be rewritten as below with much greater performance. If your calculations are parsing through 20 WHEN clauses, this new expression could be 20 times faster! This is most noticeable when MyCalc is itself expensive, particularly when it is FE (formula engine) intensive.

CASE Round(Measures.[MyCalc]/5)*5
     WHEN 0 THEN 0
     WHEN 5 THEN 5
     WHEN 10 THEN 10
     WHEN 15 THEN 15 
     WHEN 20 THEN 20 
     WHEN 25 THEN 25     
...
     ELSE 100
END

Note; you cannot put < or > operators in the WHEN statements if you use the second format above. However, with a little arithmetic, you might be able to bracket the calculation as I have done above using Round().
Another note; if might not be possible to express the CASE statement as I suggest above because there are several conditions. However, I encourage you to think laterally, as you might be able to nest a few CASE statements on the various conditions, so there are fewer calcs per CASE statement, making perform much better.

Saturday, June 6, 2015

Confusing = with IS in MDX calculations

This topic has been written about many times, perhaps even by me. However, it is a common pitfall, so I am writing about it again. See this msdn post for a recent example.

Essentially, the pitfall is when an author wants to check if the current member of a dimension attribute is at a particular member. The incorrect MDX might look like this
IIF(DimCustomer.CustomerClass.Currentmember = DimCustomer.CustomerClass.CustomerClass.&[Gold], Measures.Price*0.9,Measures.Price)

However, the "=" operator is actually testing if the tuples on either side are equal in their Measure. Since there is no Measure specified, the measure will be the one on that slice (tuple), or the default measure if there isn't one in the tuple. For the two tuples to be equal is rather random. So the results will be somewhat unpredictable.

What should be used is the "IS" operator, which will test if the members on either side are the same. That is; is the current CustomerClass Gold? So the IIF() should be rewritten as
IIF(DimCustomer.CustomerClass.Currentmember IS DimCustomer.CustomerClass.CustomerClass.&[Gold], Measures.Price*0.9,Measures.Price)

Note, if you are having problems following this logic, the IIF() with the IS operator is equivalent to the following statement.
IIF(DimCustomer.CustomerClass.Currentmember.properties("key") = "Gold", Measures.Price*0.9,Measures.Price)

Tuesday, May 19, 2015

Putting private data to public good

I was very pleased to read about IBM's collaboration with US cancer institutions. Using the history of people's cancer treatments and their progress to help determine the most effective treatment for a newly diagnosed patient. This is a great example of putting private data to public good. I have been looking for similar opportunities in the health sector for many years. There is so much scope to use historical diagnoses, demographic data, treatments and outcomes to forecast the risk and efficacy of various treatment options. However, I have found a pretty solid wall declaring such data couldn't be used for "privacy" reasons. This is a real shame, for if I were diagnosed with an illness where the health specialist was deciding between various drug therapies and surgical options, I would like that specialist to make use of all the information available. Of course, very experienced health specialists will have a vast range of experience and read reams of published studies. But I am sure that a data mining model over national (or even international) databases would help estimate risks and efficacies of the various treatment options.

I find the general concern over "data privacy" intriguing. Whether it's your Google search terms, driving records, telephone calls, everyone must accept that their personal data is held somewhere. Why are we so paranoid about our privacy? We have lived in communities for thousands of years, where we are known about in the community. We could all withdraw with curtains drawn and minimal social interaction but we would then miss out on so many community benefits.









I hope that in the future, we can all open our blinds and see more "private" data put to public good.


Monday, January 19, 2015

Basket Analysis IV - Menu Trim

This is the third post in a series on Basket Analysis. To read the first post go to Basket Analysis I.

Retail stores are always trying to balance range of stocked items with inventory efficiency. A common task is to perform a "menu trim" where low profit contribution items are removed from the menu. A simplistic approach would be to remove the items with the lowest associated sales value or gross profit. However, this can result in trimming items that, while not profitable in themselves, are attracting customers and have high "associated" sales and profit contribution.
Consider the menu items above that are ordered by item sales. The Berry Smoothie is the item with the lowest sales value and gross profit. A simplistic view would be to say let's trim Berry Smoothie off the menu as it only represents $795.23 of sales. However, from our basket analysis query, we can see that the Berry Smoothie has a high Associated Net Sales and Associated Gross Profit. If we took Berry Smoothie off the menu, we might be risking $19,400 of associated sales. This is the total value of all transactions that have included Berry Smoothie. It might be that customers are coming into the store primarily for the Berry Smoothie and getting other items with the smoothie. If we remove Berry Smoothie from the menu these customers might not patronize our store.
Basket Analysis does not answer these questions directly, but it does give us information to make a better informed decision. In the case above, we might consider trimming the Vegeburger from the menu, as it has the lowest Associated Net Sales and lowest Associated Gross Profit. It appears that customers who come in to buy vegeburgers, purchase that item alone. So if that item was trimmed from the menu, it would appear to put less sales at risk.

Friday, December 19, 2014

Basket Analysis III: Promoting Sales Items Associated with Heavy Transactions

This is the third post in a series on Basket Analysis. To read the first post go to Basket Analysis I.

One common retail strategy is to promote a relatively low value item that is associated with "heavy" baskets. That is to identify sales items that tend to be in baskets that have many other profitable items. Sometimes these items are a little quirky and customers go to that particular retailer for the quirky item, and while they are there, they purchase many other items. The rational is that if we can attract more customers to purchase the quirky item, which as little profit contribution, they will purchase many other, profitable, items. So the promotion on the quirky item can be discounted with low, or even negative, gross profit. Basket Analysis technology can help identify these items, and it can also accurately measure the performance of the campaign. 

Here is a graph showing some of the key performance indicators around a promotion of Green Smoothies. This particular promotion increased, by a factor of 5, the sales of Green Smoothies, but these increased sales did not pull through, as hoped, the additional sales. See the green line near the bottom of the chart - it shows the sales value of Green Smoothies, and you can see a clear increase in their sales during the promotion period. Now look at the Average Txn value including Green Smoothies (blue bar) during the promotion period. It drops from about $20/transaction to about $16/transaction. Also, notice the purple line showing the Green Smoothie affected sales. See how it has increased, but only by the value of the increased Green Smoothie sales. This line shows the total value of all transactions that have at least one Green Smoothie. This suggests that promotion shoppers are coming in to purchase the Green Smoothies, but they are only purchasing Green Smoothies. It is valuable to get an early appreciation of the effectiveness of this campaign, so that it can be halted or modified early.



















Note, traditional sales reporting tools could easily report on sales values of Green Smoothies, but it is only with Basket Analysis tools that we can get metrics like Green Smoothie affected sales and Average transaction value including Green Smoothies.

For the next Basket Analysis post see Basket Analysis IV.

Monday, December 15, 2014

Basket Analysis II: Would you like a drink with that?

This is the second post in a series on Basket Analysis. To read the first post go to Basket Analysis I.

A common question a customer receives when ordering food is "Would you like a drink with that?" This is an example of on-selling, where the customer has come in with an order, and the salesperson, cashier, etc. looks for an opportunity to increase the size (and profit contribution) of the transaction, and potential satisfaction of the customer, by inviting the customer to add another item to the transaction.  The activity can be much more subtle, such as "We have some hot-from-the-oven blueberry muffins today - would you like one with your coffee?" The intention is the same, and the activity usually involves some staff training and encouragement. Naturally, some training sessions are better than others and some supervisors, motivations, etc are better than others at encouraging staff to invite these on-sell opportunities. Basket Analysis is a great tool to help monitor and analyse the effectiveness of staff, training, and promotional activities. For example, a program could be developed to encourage staff to offer oven-hot blueberry muffins. Without Basket Analysis tools, a sales report might display the ratio of blueberry muffins per 1000 transactions to compare various staff, training programs etc. However, this report will be inappropriately affected by a customer purchasing a dozen blueberry muffins in one transaction. Basket Analysis will provide metrics such as % of hot drink transactions that contain blueberry muffins.

The charts below report on the effectiveness of the training campaign.  The top graph shows that all 6 cashiers are handing similar numbers of transactions with hot drinks.  The training campaign was conducted during week 29. Notice how most cashiers increased their attach rate of muffins by about 30%.  Elizabeth did not change her attach rate (maybe she missed the training) and Barry's attach rate increased, but not as much as the other cashiers.

For the next Basket Analysis post see Basket Analysis III.

Sunday, December 14, 2014

High Performing Big Table Loads in SSIS

If you are trying to load a very large SQL Server table with fast load, you might be thwarted by not having the rows per batch/commit that you think you have.

Here is a rather vanilla fast table load destination using SSIS data flow task. Notice how I have Rows per batch=200,000 and Maximum insert commit size=200,000. You might expect that the fast load will insert about 200,000 rows per batch and transaction. You would be wrong, as it is also gated by the buffer size of the data flow task. To determine how many rows are being inserted per batch/transaction, simply use SQL Profiler and trace the INSERT BULK commands. They are simply SQL:BatchCompleted events. To see the number of rows, look at the RowCounts column in the trace. It will probably be a little less than 10,000.
Firstly why is 10,000 rows per commit slow? If you have no indexes on your large table, the chances are that the loading is very fast. However, if there are non clustered indexes, the loading is slowed down by sorting the batch records into each index sequence and applying them to the indexes. Applying the keys to the index could touch up to 10,000 pages. Then the updates are committed, requiring writing out to disk all updates. Increasing the commit size can make updating indexes more efficient. Note, the performance does not get forever faster as the commit size increases. There is a memory cost with the large commit sizes and sorting larger groups of records increases at a greater then linear rate. To determine the best size for your task, run and monitor several sizes and plot their speeds. Choose the buffer size (batch/commit size) that maximizes load speed. Note, the optimal speed for a small table, will not be the same once the table becomes very large with indexes larger than db cache memory.

Also note, if you are loading a table from scratch, it can be more efficient to load the table with just a clustered index, then create the indexes in parallel after loading. Of course, partitioned tables will give you more flexibility in tuning your index build and loading.

Anyway, to increase the commit size, you will need to increase the Data Flow buffer size. See below, the DefaultBufferMaxRows is 10,000 and DefaultBufferSize is 10,485,760 bytes. Increasing these configuration settings to get larger commits.


Monday, December 8, 2014

Basket Analysis I: What is Basket Analysis

This is planned to be the first blog in a series on Basket Analysis.  In this blog, I will try and answer the question, What is Basket Analysis?

Basket Analysis technology offers an exciting and innovative way of looking at retail data. Simple business intelligence will provide retailers with all sorts of analysis by individual products, including sales by cashiers, customers, profit contribution, waste, discounts, labour, time of day, weather,etc. All this analysis is essential for most retailers in today's market. However, more and more retailers are using the behaviour of shoppers to look for opportunities to improve their operation.

Basket Analysis is about analysing which sales items are bought together (in the one transaction, or different transactions for the same customer).   A Basket Analysis solution will enable you to query and report on which products are bought together and what the profit contribution of all the "baskets" (transactions) that involve a combination of products that are in, or out of, the transaction.

Basket Analysis is not a set of metrics or KPIs from which you can run your business. It is a data solution that allows you to ask questions about the product mix of transactions. Perhaps the most simple question is "Tell me information on the total transaction value by individual product items".   For every sales item, show me the
  • Item Average Price
  • Item Sales Quantity
  • Item Sales Value
  • Item Cost of Goods
  • Item Gross Profit %
  • Item Average Discount
  • Transaction Sales (total value of transactions including this item)
  • Transaction Count (number of transactions including this item)
  • Transaction Average Value
  • Transaction Average Quantity (average items/transactions)
  • Transaction Average Discount
  • Transaction Cost of Goods
  • Transaction Gross Profit
  • Transaction Gross Profit Percent
  • Transaction Value Multiple (total transaction sales/item sales)
Here is an illustration report of that very simple question that might be asked of a Basket Analysis solution.  It is simple, because the analysis is by individual items.  A more sophisticated query would be on particular combinations (existence, or non-existence) of items.  That is for subsequent blogs in this series.


From this simple report there are already many interesting insights available.  For example
  • Items that have a low sales item value but high transaction value (Transaction Value Multiple) are items that whilst they contribute very little profit themselves maybe bringing in customers who purchase more profitable items.
  • Items with a very low Transaction Value Multiplier, close to 1, are bought alone and probably do not induce customers to buy other products.
  • This simple report for one sales item, such as Cheese Cake, by waiter/cashier can give you insights as to which staff/outlets are good at on selling.
  • Sales items with the highest Transaction Gross Profit are ones that are likely to be important to your business, even though they may not represent well in top individual sales items.
For the next Basket Analysis post see Basket Analysis II.


Thursday, November 20, 2014

Yikes - A read query is blocking everything!

This is something I saw in an OLTP database with an application (.NET) read query scanning a multi-billion row table while holding a full table S lock. The query thought it was using the default isolation level of read committed. Almost every other query in the system was locked in a chain behind this read query as all update queries on the table (arguably the most important table in this OLTP application) also prevented any other read queries from jumping the queue and had them waiting for a lock too.

By default, SQL Server read queries will use a locking isolation level of READ COMMITTED. Most developers just assume this and, understandably, do not bother to set isolation level. With READ COMMITTED, a read query will only lock the current record while scanning many rows. It has an "intent" lock on the table, which is compatible with all other "intent" locks on the table. However, if you are using connection pooling, which most applications do nowadays, it is possible that your query has picked up an isolation level of REPEATABLE READ from a recycled connection. This is quite serious. Now your query won't release locks as it scans through rows, and once it has read a few thousand rows, it will stop taking row/page locks and lock the entire table.  This is called Lock Escalation, as you have taken so many locks on one object, SQL Server says, "that's enough you may as well have one lock on the entire table". Now it gets interesting. Any INSERT, UPDATE, or DELETE query will have to wait on your query, since they are incompatible with your S (share) lock on the table. What happens next is even more interesting (in a nefarious kind of way). All subsequent read queries (except NOLOCK read queries) will wait for a lock too.  They are compatible with the full read lock but they are incompatible with the IX locks of the insert, update, delete queries, so they must wait in queue. I call this the "no queue jumping" rule.

How exactly does your query get a REPEATABLE READ isolation level when it didn't ask for it? The answer is connection pooling. An application program can change the isolation level of the connection and leave the changed isolation on the released connection. The next thread to pick up the connection will have the isolation level left behind. Many people have raised this issue before and there has been a Connect issue supported by many developers.
http://connect.microsoft.com/SQLServer/feedbackdetail/view/243527/sp-reset-connection-doesnt-reset-isolation-level
Unfortunately, I feel, Microsoft's response has been to close the "Bug" with an "as By Design" flag!

So what can we do to avoid queries escalating to full table locks when they assume isolation of READ COMMITTED?
  1. The cleanest solution would be for the connection pool to pass on used connections with their default settings. This is unlikely to happen in the near future as indicated above by Microsoft. 
  2. Yould could disable connection pooling. This is quite heavy handed and you will pay for it in a slower application as new connections generally take 100's of milliseconds to create.
  3. In your application that uses connection pooling, ensure that all connections are reset to the default isolation level before release. Note, sp_reset_connection does not do this. Also note, there might be times when a piece of code does not follow this standard. Unfortunately, the non standard code won't be affected.  It will be the innocent code that takes on the connection that initiates the locking issue.  
  4. In your application that uses connection pooling, ensure that all new connections are set to the desired isolation level. Ie, don't rely on a new connection having the default settings. This is probably a good practice, as all code that follows this standard will be OK.
  5. Use WITH (NOLOCK) table hint on your read queries. This will certainly avoid the lock escalation. However, there can be interesting data integrity issues in reading with no locks, such as reading the same row twice. I think there is a place for NOLOCK, but I wouldn't advise the use NOLOCK throughout your application. The problem is the use of REPEATABLE READ not READ COMMITTED.
  6. Use WITH (READCOMMITTED) table hint on your read queries. This will override REPEATABLE READ isolation level on the connection for this table. So the problem above is avoided, but it is a bit of a task changing all the queries to use this table hint. Note, query hints have a bad rap, which is understandable when they override the optimizer's choice of access path. I see no problem with query hints when they are requesting a specific isolation level.
  7. Use READ COMMITTED SNAPSHOT ISOLATION on your database. This is not a solution as the read query is on a connection that has a REPEATABLE READ isolation. As the name suggests, READ COMMITTED SNAPSHOT ISOLATION only works on connections that are using READ COMMITTED
  8. Use TRANSACTION COMMITTED SNAPSHOT ISOLATION on your database. This is not a solution either, as it does not override the REPEATABLE READ isolation level of the read query.
  9. Wait for upgrade to SQL Server 2014.  It looks like the isolation levels are reset on pooled connections.  However, you should check, as this information has simply be posted in an msdn forum, albeit from a SQL team member. 
How can I determine if REPEATABLE READ is being used by my application?
  • A SQL Server Profiler trace on ExistingConnection events will provide the isolation level of all active connections. Just look for set transaction isolation level and see what follows. If any of your current connections have REPEATABLE READ you should be concerned. ExistingConnection will only display currently active connections. To see if isolation level is changed, you will need to trace SQL:BatchCompleted and SP:StmtCompleted, and Prepare SQL. Looking for the text '%set%transaction%isolation%level%repeatable%read%'.
  • Another useful Profiler trace is to trace Lock:Escalation. I believe that any table lock escalation in your OLTP environment should be of some concern to you, particularly if you have locking or deadlock issues. However, if you have too many to look at, you could filter in/out by Type=5 (Object), objectId (table) and/or Mode=3 (S-share).
How lucky are we? There are two ways of viewing the locking incident.

  1. Unfortunate that it happened and get a quick fix asap for this particular query.
  2. Fortunate that this relatively small incident occurred, which enabled us to uncover something quite serious (full table locks in an OLTP environment). The issue, has given us an opportunity to discover the dynamics of connection pooling and helping us find an application wide fix to prevent its future occurrence. Also, if there is a future locking incident, we have more knowledge to diagnose that too. I see this as analogous to someone having high blood pressure and going to the doctor for a fix.  The high blood pressure could be seen as annoying and simply needing some pills to lower the blood pressure, or it could be an opportunity of determining the cause of the high blood pressure (be it diet, stress, etc.) and rectifying the underlying cause.








Wednesday, November 19, 2014

Relative Dates - Today, Yesterday, LastWeek etc.

A useful little piece of functionality I often add to OLAP cubes is what I call Relative Dates.  In the Date dimension I add attributes CurrentDay, CurrentWeek, CurrentPeriod, CurrentYear etc.  These attributes are simply functions within the Dates dimension view, which gets processed early every morning. For example CurrentDay could be something like
Case
when d.MyDate=convert(date,getdate()) then 'Current Day'
when d.MyDate=convert(date,dateadd(dd,-364,getdate())) then 'Current Day LY'
when d.MyDate=convert(date,dateadd(dd,-1,getdate())) then 'Yesterday'
when d.MyDate=convert(date,dateadd(dd,-365,getdate())) then 'Yesterday LY'
else 'Other days'
end CurrentDay,

Note, Current Day Last Year is -364 days, which will ensure that it will be the same day of week. It's the same as saying -52 weeks.

Note, you only need to process the Date dimension, the large fact tables do not need to be reprocessed, as their Date dimension keys are not changing. It's just the, flexible, attributes of the Date dimension that are changing daily.

Note, it is not practical to create a hierarchy of these attributes as they don't fall within oneanother. For example, Today and Yesterday might be in the same fiscal week, or they might fall in consecutive weeks.

The logic for CurrentWeek etc is a bit more complicated as your case statement will be checking if the FiscalWeek of getdate() is the same as the FiscalWeek of MyDate. Eg.
case
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,getdate())) then 'Current Week'
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,dateadd(WW,-52,getdate()))) then 'Current Week LY'
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,dateadd(WW,-1,getdate()))) then 'Last Week'
when d.MyWeek=(select dd.MyWeek from tbDates dd where dd.MyDate=convert(date,dateadd(WW,-53,getdate()))) then 'Last Week LY'
else 'Other weeks'
end CurrentWeek,

You can do the same sort of logic for Periods, Quarters, and Years. Note, this functionality does not replace Time Intelligence. If anything it adds to it as Time Intelligence will go across the Relative Dates attributes, or, more likely, you can use the Relative Dates attributes to filter dates and use a regular date hierarchy on rows/columns with Time Intelligence. The advantage of the CurrentDay, CurrentWeek, CurrentPeriod, and CurrentYear attributes is that reports can be relative in time. Ie, the filter can be a multi-select of this period and last period, showing weeks on columns, which will have 5-9 weeks.  So the report will always display the last 5-9 weeks, without the user having to select the current period etc. These filters will work in Reporting Services, Excel, PerformancePoint, or any other cube browser/reporting tool.

Wednesday, September 3, 2014

SQL Functions and Plan Caching

I feel that I have become typecast as a Business Intelligence specialist.  However I often employ my SQL skills and I still enjoy the challenge of solving pure SQL performance issues.  Here's an issue, which presented itself as a table valued Function in an OLTP environment that was taking a lot of compilation energy (6.7 seconds) every execution, while the execution was only 0.15 seconds.  It was executing about 1/second at peak.

Now, I am a great fan of functions, but I know that they are not compiled like stored procedures, so I am weary of using them for complex SQL queries.  This query was very complex.  I've got the function's code below, but it is based on a much more complex view, which joins 5 tables.  There is a picture of the function execution above.  See how the execution time is only 125ms, while the parse and compile time is 6638ms!  The obvious solution is to convert this function to a stored procedure, which will enable SQL to store the plan.

However, this would involve changes to .NET code, which would involve a lot of testing and change control.  I found another solution, which enabled us to fix the Function without having any change in .NET code, permitting a quick fix to production.  The fix was to alter the function so that it has multiple lines!  This way SQL Server will treat the 'multistatement table-valued function' as a stored procedure and enable it to be cached.  You can see from the execution results below that their is no parse and compile time, and the execution time is the same 125ms.
How does it do this?  It appears that SQL Server will treat the multistatement table-valued function as a stored procedure.  We can see this from the plan cache query below.


















Note, SQL Server is able to cache Adhoc queries, so that your execution of a function with a particular parameter can be cached.  Don't let this cloud your analysis, unless your OLTP environment is going to  execute the function with exactly the same Adhoc text.  You will see these 'Adhoc' cached plans above with objtype='Adhoc'

Note, there is documentation that explains this behaviour, although you might have missed it.  See http://msdn.microsoft.com/en-us/library/cc293623.aspx and read the text near 'multistatement table-valued functions'.

What does this tell us?  Firstly, it reminds us that compile time is not always trivial.  Secondly, it reminds us that we should treat Functions as different to Procedures, and only use Functions where they are appropriate.  I find them most useful where they are doing something like string manipulation and optimisation is not appropriate.

Old function (with 6.8 second compile time every execution)
create function MyFunction 
(@Parameter1 int) 
RETURNS TABLE AS
RETURN (
SELECT t2.*,t1.MyDate
from (select max(t4.TheDate) MyDate, 
  t3.BranchNo,
  t3.TranCode, 
  t3.Cash1, 
  t3.Cash2,  
  t3.Cash3, 
  t3.Cash4
      from vBranchCount as t3 
      left outer join  (SELECT *
                        FROM vBranchCount
                        WHERE vBranchCount.Status = 'Applied') AS t4 
ON t3.TranCode = t4.TranCode 
                AND t3.BranchNo = t4.BranchNo 
                AND t3.Cash1 = t3.Cash1 
                AND t3.Cash2 = t4.Cash2 
                AND t3.Cash3 = t4.Cash3 
                AND t3.Cash4 = t4.Cash4 
                AND t4.TheDate < t3.TheDate
      where t3.StockCountPlaceId = @Parameter1
      group by t3.BranchNo,t3.TranCode,t3.Cash1,t3.Cash2,t3.Cash3, t3.Cash4) as t1
inner join vBranchCount as t2 
  on t1.BranchNo = t2.BranchNo
 and t1.TranCode = t2.TranCode
 and t1.MyDate = t2.CashDate
 and t1.Cash1 = t2.Cash1
 and t1.Cash2 = t2.Cash2
 and t1.Cash3 = t2.Cash3
 and t1.Cash4 = t2.Cash4  )

New Function with 6.8 second compile time only on first execution.
create function MyNewFunction 
(@Parameter1 int) 
RETURNS @t TABLE
(  TeamID bigint,
           Rain bit,
           WitchHat varchar(50),
           JoyStormID bigint,
           Happyness int,
           TeamName varchar(250),
           CashIn money,
           CashOut float,
           TheDate datetime,
           JoyClerkID bigint,
           JoyClerkName varchar(150),
           TranCode bigint,
           Bills float,
           CashDate datetime,
           BranchNo bigint,
           Temperature varchar(50),
           StockCountPlaceId bigint,
           Applicability int,
           Readyness int,
           SiteID bigint,
           BillsToChange int,
           Cash1 varchar(50),
           Cash2 varchar(50),
           Cash3 varchar(50),
           Cash4 varchar(50),
           JoyUnit1 int,
           JoyUnit2 int,
           JoyUnit3 int,
           JoyUnit4 int,
           TranSystemId bigint,
           Handle1 varchar(35),
           Handle2 varchar(35),
           Consumer varchar(50),
           JoyConsumerId bigint,
           Rainbow varchar(50),
           StormCode varchar(50),
           ChangeDate datetime)
 AS
BEGIN  
INSERT INTO @t
SELECT t2.*,t1.MyDate
from (select max(t4.TheDate) MyDate, 
  t3.BranchNo,
  t3.TranCode, 
  t3.Cash1,
  t3.Cash2,
  t3.Cash3, 
  t3.Cash4
 from vBranchCount as t3 
 left outer join  (SELECT *
FROM vBranchCount
WHERE vBranchCount.Status = 'Applied') AS t4 
ON t3.TranCode = t4.TranCode 
AND t3.BranchNo = t4.BranchNo 
AND t3.Cash1 = t3.Cash1 
AND t3.Cash2 = t4.Cash2 
AND t3.Cash3 = t4.Cash3 
AND t3.Cash4 = t4.Cash4 
AND t4.TheDate < t3.TheDate
 where t3.StockCountPlaceId = @Parameter1
 group by t3.BranchNo,t3.TranCode,t3.Cash1,t3.Cash2,t3.Cash3, t3.Cash4) as t1
inner join vBranchCount as t2 
 on t1.BranchNo = t2.BranchNo
 and t1.TranCode = t2.TranCode
 and t1.MyDate = t2.CashDate
 and t1.Cash1 = t2.Cash1
 and t1.Cash2 = t2.Cash2
 and t1.Cash3 = t2.Cash3
 and t1.Cash4 = t2.Cash4   

RETURN
END

Query plan cache

SELECT usecounts, cacheobjtype, objtype, text 
FROM sys.dm_exec_cached_plans 
CROSS APPLY sys.dm_exec_sql_text(plan_handle) 
WHERE (text like '%MyFunction%' or text like '%MyNewFunction%')
and objtype <> 'Adhoc'
and cacheobjtype='Compiled Plan'
ORDER BY usecounts DESC;