Yesterday I attended a meetup event organized jointly by the OR society and Dell-Statistica discussing the use of predictive analytics to better patient care. The topic is very interesting and the discussions were very lively. But the big news for me was to discover that Dell have purchased Statsoft and are now promoting Statistica. I hope they do not repeat the mistakes IBM did (and is doing) when it took over SPSS. For starters Statistica is better than SPSS on several level. One of them being that, like sas, it has a strong data management capability of its own. Do to real stuff with SPSS you need to link it to some over expensive IBM product. Last time I looked you also got more bang for your buck compared to SPSS – i.e. more functionality. I am refraining from fully comparing it to sas as I believe that in Europe it is irrelevant due to the sas pricing model. Even if sas is better by far than any other solution, most organisations in Europe, and the far east to that matter, will struggle to compile a business case for the expenditure. If you can afford it, sas is still my first choice. However R and Statistica are close behind. I will be watching Dell to see how they position the software and analytics services crossing my fingers they manage to find a way to turn it into a cohesive offering (like sas) fast rather than the hodgepodge of mixed messages one gets from IBM.
Friday, 13 December 2013
A couple of days ago I attended a sas professionals (http://www.sasprofessionals.net/) event focusing on sas V9.4 which is due to be launched in Europe in January 2014 (with statistics 12.3, 13.1 to follow towards the end of the year) . As usual there is so much new terminology to learn and new paradigms to get one's head around. Naturally I concentrated on what really interests me - Analytics. But there are some non-analytics things that might interest analysts such as myself:
- Sas has significantly hardened the security
2. There are a few new ODS destinations that are aimed at the mobile device world. But the one that is to me the game changer is the ODS to MS PowerPoints completing the suit of preferred delivery platforms. Let me spell this out a good sas programmer can create automatically sleek pdfs, excels, PowerPoints. Now it all can also be ziped automatically with an ods option.
- Sas has introduced two new scripting languages: FedSQL and DS2. The latter, DS2, is something every sas programmer who respect his-self should know. It harks back to the AF object oriented SCL (oh the good old days) so sas dinosaurs like my self will feel at home. The power, according to the presenters, is the ability to truly harness parallel programming and code objects that a truly portable to other environments. We are just facing a case where we could have benefited for the latter feature - we created an amazing solution and now the client wants the beautiful data steps dumbed down to SQL. In the new world we can just hand over the DS2 and it will work as is in say Oracle.
- The IT oriented people will be thrilled with the new embedded web-server (save some licencing money there) and the shinny new sas environment manager
On the analytics side the most interesting development I noted was the High Performance procedures. They are designed for MPP environments doing true parallel-in-memory processing. They come in bundles focusing on: statistics, econometrics, optimisation, data mining, text mining, forecasting. It seems that the re-written engines also perform significantly better on SNP environments (you know the pcs and servers we are using). In essence the technology uses the hardware better than ever as long as you have more than one core and a enough memory assigned to you. A small, but useful, HPxxx procs will be included in sas base if one licences other statistically oriented packages (stat, or, ets, miner …) . It would be interesting to stress test them on a SNP environment and figure out the optimal settings.
It seems to me that most of the new features that were discussed for the EM 12.3 are features that were there in 2.0 till 4.0 but disapeared in the move to the thin client in 5.0 such as Enhanced control over Decision Trees. A new and interesting additions is the Survival data mining introducing time varying covariates.
I will defiantly have to look deeper into
- Sas Contextual Analysis
- Recommendation engine
One interesting observation is the not many chose to go to the analytics session but to the BI and Enterprise Guide ones. Am I of a dying kind? Or is it that all the sas statistical programmers are so busy they do not have time to come to events such as this?
Tuesday, 19 March 2013
Is there a business case for underpinning strategic human capital planning with advanced numerical analytics?
Too many managers hasten to respond negatively to the question I posed in the title before really understanding fully the terminology. Evidence based decision making will never replace the good old intuition, gut feeling or back of a fag-packet decisions. To get these right you have to be brilliant and lucky. Even if you are, you have taken care of the high level but not of the details. An experienced architect will be able to immediately tell you during a site visit that there are several ways to build a bridge and propose an off the cuff strategy (say a metal hanging bridge). Even if we do not explore other options for building the bridge we cannot (and should not) proceed without detailed plans and costings. But that is exactly what is happening again and again when companies make decision about their most important resource – their people.
Most managers associate strategic human capital planning with figuring out how many people are required to perform a task. For instance how many level 2 engineers are required to handle expected peak demand for boiler repairs call-outs. This could be refined by engagement types and cost. Although this could address the immediate term need and ensure a good service level, the long term effects are not considered. For instance the future burden on the pension pot, the expected strain on the training centre due to a high employee churn and career funnel bottle necks should be evaluated and quantified. And here lays the business case – putting your finger on the long term (usually hidden) costs that could be avoided.
A good strategic plann needs a representative sandbox. Analytical tools such as predictive modelling (what would be the demand?), simulation (this is how it works today) and optimisation (what options should I consider?) should be used in highly complex situations where the impact of a decision is multifaceted. For example, it is straightforward to expect that restricting an aeroplane mechanic to one hanger will result in lower utilisation rates. However , the impact on the number of pilots required due to filling in for colleagues waiting for planes to be serviced is not linear and is co-dependant with several other leavers that could be set at different levels.
Taking timeout from the whirlwind of fire -fighting to look at the bigger picture is imperative for the business’ long term health.
Sunday, 10 February 2013
I could not agree more with Thomas C. Redman’s post “What Separates a Good Data Scientist from a Great One” (Harvard Business Review, January 28, 2013, ). I would like to suggest that sometimes it is not just down to the traits of the person doing the job. There is also an element of the company culture and environment. It got me thinking of my past experiment where the same people did great work and just work. You can employ the best data scientist in the world; but are you allowing her to be one?
Redman discussed four traits: A sense of wonder, a certain quantities knack, Persistence, and technical skills. Some of the commenters suggested business acumen, courage, Mathematician, and programmer should be added to this list. Interestingly attention to detail was not directly mentioned. So what is an environment that is conducive for grate data science work?
Good data scientists are allowed to become great when the people they works with and for understand the importance of this type of investigation and realize it is an R&D approach. I have seen many situations where the data scientist was working in a ‘consulting firm’ role. i.e. the role was defined as providing a service to the business unites. This, in itself, is a very good model which I like very much as it ensures a deep understanding of the business and cross fertilisation of ideas. The difference between good and great is in the way work is prioritised and the time allocated for its completion. On the one end of the scale, the data scientists are allowed to only respond to work requests sticking to the defined scope. This will reduce the best data scientist to a BI programmer; and trust me it is very easy to fall into this path of least challenge attitude. Everybody is happy but the point is lost.
On the other end of the scale we have the ‘please do not bother us with trivia’, strategic thinkers who works in an academic mode on work that comes only from C-level managers and are given milestones that are months apart. To be able to pull that off one needs to be a really super data scientist working with a dream team of c-level managers. Too often I have seen these teams loosing the plot for lack of tension and focus.
As always the correct balance has to be struck. I worked in such an environment, where we mainly provided straightforward analytics (and yes, BI) to the business units but we were also given space to suggests investigations of our own. The culture was of ‘go ahead and try it, if it does not work we still learnt something’. More importantly, the top managers made a healthy distinction between a simple delivery of the findings and a simple approach (what I call the sat-nav model where the device provides a simple interface to a very complex solution). The atmosphere changed when a reorganisation brought in a new management that didn't see the value of doing more than one was asked for and spending time on investigating alternative analytical approaches. I think they have now reverted to the stone age of doing forecasts using univariate liner regressions in excel.
To pull one’s team from either of the edges of this continuum the manager of data scientists must be persistent as suggested in the original post but also a good communicator who can build trust in the quality and importance of the analysis.
Wednesday, 30 January 2013
I just completed the 2013 Rexer Analytics data mining users survey. I make it a point to do my best to complete these surveys as they usually make me stop and think. This year there were two questions that were very relevant.
I am just finishing off a nice project I did for an international retailer that brought me in to run the process of choosing a Marketing Intelligence Platform (in English to choose a data mining software and to figure out how it should be deployed). One of the most interesting challenges of writing the RPI and RFP was agreeing with the business what was important to them. I found it pleasing that most of the points I put forward for discussion were listed in one or two of the questions in the survey. I will hold off voicing my very strong opinions until the survey results are published.
I am also very curious to see the results of the survey with regards to software preference and how the response has varied over time (this is one of the constant questions). During the process of engaging with the software providers and researching the web I have come to realise how much this arena has changed just in the last few years. I would believe that the opulence of solutions might, to an extent, lead to software selection paralyses. It is important not to drop the ball and remember why your organisation is looking into data mining and clearly define what you are expecting of the software to deliver (back to the original question above).
Do your bit and complete this survey (www.RexerAnalytics.com/Data-Miner-Survey-2013-Intro.html) – lets see how the responses pan out.
Tuesday, 22 May 2012
A client that is preparing to roll out windows 7 64 bit to all its employees asked me to ensure that the sas functionality is not lost. Currently they use V9.2 on xp in the good old fashioned way – disparate pc installations of base/stat/graph/ets ….
Repeating the same installation script used for xp 32 bit I have experienced only one difference – the access to office files. This applies to the versions of office that use the four letter extension (e.g. xslx instead of xls). The fix is to add to the installation script the ‘pc files server’ and modify relevant proc import and libname statements.
The statement in Win XP (32)
proc import file="<filename>.xlsx" out=<dataset> dbms=excel; run;
should be modified in Win7 64 to specify a different dbms
proc import file="<filename>.xlsx" out=<dataset> dbms= EXCELCS; run;
Pointing at an MS Access collection in Win XP(32)
Libname mylib "<filename>.accdb";
should be modified in Win7 64 to specify the pc files engin
Libname mylib pcfiles path=""<filename>.accdb";
Exploring the issues in the installation I had interesting chats with the IT people responsible for purchasing sas and packaging it for enterprise wide installation. I could fill pages and pages discussing their thoughts, pains and complaints. It boils down to poor documentation (EXPALANTION) of the installation decisions that need to be made and the PRICE. I had to scrape the person from the floor after he got the quote from sas for a server (no frilly stuff like BI or EG).
Thursday, 19 April 2012
Yesterday I attended the “Multiple Imputation for missing data: State of the art and new developments”. It definitely lived up to the title. The presenters (James Carpenter, Jonathan Bartlett, Rachael Huhes, Ofer Harel and Shaun Seaman) described in a manner I could easily follow latest developments in this field. I am very interested now in trying out the Chain Imputation, the Full Conditional Spesification (FCS) and the combinations of the Inverse Probablity Weighting and the Chain Imputation approaches. The latter makes a lot of sense to me as it provides a two stage approach to the imputations where the first stage deals with missing records (completely or mostly - my language) and the second with partially missing records.
The discussion really brought home to me the importance of understanding the mechanism of ‘missingness’. Yes we all learnt that at university but it does not harm to be reminded. It is not just mastering the technology to get it to run the imputation (sas has a node in EM and proc MI) but also really really understand what you are doing. That would be achived by talking to the people who gathered the information and investigating the reasons for missing information and assumptions that could (should) be made.
One of the key questions asked by the audience was whether there was a measure or a methodology to indicate how useful the imputation was and whether it was required in the first place. You guessed it – there is not. The key consideration is not missing data but missing information (Ofer Harel had an interesting approach to get closer to this ). For example if the missing information is missing completely at random the full records contain all the information about the correlations then there is not need to impute. Using the percent of missing data is not indicative either: for example when analysing a rare event the 0.5% missing obs might be just those that hold the key to understanding.