Wednesday, March 20, 2013

Pitfalls of Big Data

I have just assisted in another Big Data project review.  There appear to be a set of common pitfalls that new Big Data developers fall into.   In this blog I hope to shed light on a few pitfalls that are quite easy to avoid, but difficult to remedy. 

  • Insisting on flat file input sources, then simply load these into flat database tables.  I understand that some source systems are remote and that flat files are the appropriate vehicle to transport data to your warehouse.  However, if the source system is a relational system within your operations centre, then it will be much easier and more reliable to simply read the data straight from the source system.  It is more reliable, as databases will ensure data typing and referential integrity.  It will be faster as it avoids all the IO involved with writing the flat file and reading it again.  There is also the maintenance issue of adding/removing columns from flat files.  Typically both consumers and providers need to be synchronised for such a change, whereas a database can accommodate this change without the need for consumer and providers synchronising their change.
  • Measuring your Big Data project by the number of tera-, peta- or exa- bytes. This is an almost irrelevant figure.  Just like the amount of money companies spend on R and D won't tell you which one will come up with the next big idea. Big Data projects should be measured by the beneficial value of the information they provide.  When specifying storage requirements, amount of storage always comes after performance specifications (MB/second and ms/IO, for example).
  • Engaging the services company with the best Big Data sales person. Consultancy and software firms have discovered that Big Data projects can be very large and profitable.  Consequently, their best sales persons are employed to win, and they aim for the biggest projects. The big project is not always in the interest of the client, and the best salespersons are not always followed by the best developers. 
  • Technology is not your most important decision. The technology you choose is important, but it's not the overriding factor.  Also, as a corollary, only choose the technology, as you need it.  For example, if you are planning on taking 12 months to load the data, leave selection of the visualisation technology until after you have started loading.  This will enable you to make a better choice on the software (you will have some good sample data) and the software market will be a few months more mature.  
  • Purchasing production hardware before you need it.  If your project isn't going live for 12 months, there is no need to purchase production software until close to that time.  At that time you will have a better idea of the specifications required, and the hardware will be a little more advanced and/or cheaper.
  • Taking Kimball or Inmon literally and make all dimension attributes Type II.   Some naive Big Data developers worship Kimball and measure their technical prowess by how many Type II dimensions and attributes they have.  So much so that not only do they make your DW performance suffer, they remove natural keys making querying on current values impossible for end users.  It's OK to have Type II dimensions, but only where they benefit the business users.  Almost always, when you have Type II dimensions, you should also keep your business (natural) keys in the fact tables.  For example, a transaction record might have a surrogate key for Customer, which points to the customer record, as it was at that time.    If the Customer attributes change, later transactions will point to a different Customer record with a higher surrogate key.  Now, when the business selects transactions for customers that belong to "Harry", they wont see transactions for customers that belong to Harry unless Harry was the customer owner at the time of the transaction.  This can be very confusing and annoying to business users.
  • Reload entire tables that only get inserted to from source systems.  This is quite obvious even to naive developers and consequently an uncommon pitfall.
  • Reject data from the business because it doesn't conform to a Big Data rule.  For example, reject transactions because they have an invalid ItemId column.  This is a kind of "Holier than thou" issue.  The Big Data custodians want to keep their warehouse clean so they reject bad data.  The problem is that this record was for a sale, it has an amount and is part of the gross sales.  By rejecting it, the warehouse has got an invalid view of the business.  A valid view would be to include the transaction, albeit, with an "unknown" sales item.  It might be that at a later time the sales item table comes through with that missing ItemId, which the warehouse should accommodate without rejection or prejudice.
  • Force your business users to using one querying/visualisation technology.  The Big Data project will have selected specific software for collection, loading, storing, cubing, data mining etc.  However, the business users should be free to select the technology of their preference for querying and visualising this information.  Nowadays, structured data is well suited to OLAP cubes, and most warehouses exploit OLAP cubes.   There are hundreds of software applications that can consume cubes, and the business users should be free to use the ones they prefer.  It doesn't matter to the warehouse how many different cube browser technologies there are.  It might matter how many concurrent queries there are, but that's a different matter.

No comments: