Tuesday 22 May 2012

Migrating sas from XP(32) to Win7(64)

A client that is preparing to roll out windows 7 64 bit to all its employees asked me to ensure that the sas functionality is not lost. Currently they use V9.2 on xp in the good old fashioned way – disparate pc installations of base/stat/graph/ets ….
Repeating the same installation script used for xp 32 bit I have experienced only one difference – the access to office files. This applies to the versions of office that use the four letter extension (e.g. xslx instead of xls). The fix is to add to the installation script the ‘pc files server’ and modify relevant proc import and libname statements.
The statement in Win XP (32)
proc import file="<filename>.xlsx" out=<dataset> dbms=excel; run;
should be modified in Win7 64 to specify a different dbms
proc import file="<filename>.xlsx" out=<dataset> dbms= EXCELCS; run;

Pointing at an MS Access collection in Win XP(32)
Libname mylib "<filename>.accdb";

should be modified in Win7 64 to specify the pc files engin

Libname mylib pcfiles path=""<filename>.accdb";

Exploring the issues in the installation I had interesting chats with the IT people responsible for purchasing sas and packaging it for enterprise wide installation. I could fill pages and pages discussing their thoughts, pains and complaints. It boils down to poor documentation (EXPALANTION) of the installation decisions that need to be made and the PRICE. I had to scrape the person from the floor after he got the quote from sas for a server (no frilly stuff like BI or EG).

Thursday 19 April 2012

Multiple Imputation for missing data: State of the art and new developments

Yesterday I attended the “Multiple Imputation for missing data: State of the art and new developments”. It definitely lived up to the title. The presenters (James Carpenter, Jonathan Bartlett, Rachael Huhes, Ofer Harel and Shaun Seaman) described in a manner I could easily follow latest developments in this field. I am very interested now in trying out the Chain Imputation, the Full Conditional Spesification (FCS) and the combinations of the Inverse Probablity Weighting and the Chain Imputation approaches. The latter makes a lot of sense to me as it provides a two stage approach to the imputations where the first stage deals with missing records (completely or mostly - my language) and  the second with partially missing records.

The discussion really brought home to me the importance of understanding the mechanism of ‘missingness’. Yes we all learnt that at university but it does not harm to be reminded. It is not just mastering the technology to get it to run the imputation (sas has a node in EM and proc MI) but also really really understand what you are doing. That would be achived by talking to the people who gathered the information and investigating the reasons for missing information and assumptions that could (should) be made.

One of the key questions asked by the audience was whether there was a measure or a methodology to indicate how useful the imputation was and whether it was required in the first place. You guessed it – there is not. The key consideration is not missing data but missing information (Ofer Harel had an interesting approach to get closer to this ). For example if the missing information is missing completely at random the full records contain all the information about the correlations then there is not need to impute. Using the percent of missing data is not indicative either: for example when analysing a rare event the 0.5% missing obs might be just those that hold the key to understanding.

Friday 24 February 2012

What is an insight team all about?

How to capture succinctly the uniqueness of an insight team? How to ‘elevator ride’ describe what are the key benefits of having such a team? In my mind, it is first important to make a clear distinction between Business Intelligence (BI) which is about dissipating meaningful data to the people who need it. A good BI is essential for successful Analytical Intelligence (AI) as it frees the team from the all-consuming bush fire fighting.

I like the title of an Experian document “Analytical insight: bringing science to the art of marketing”. One of the section headings, “Turn data into intelligence ” makes a good stab at the question.

How about “A team of knowledge workers turning data into intelligence, insight and action using advanced tools and techniques”:

  •  Knowledge workers = highly qualified and experienced
  • Turning data = Evidence Based
  •   Intelligence = that backwards mirror – BI
  • Insight = statistical analysis (forecasting, basket analysis, churn analysis, channel optimisation, etc.)
  • Action = informing strategic decisions (e.g. how many sales people do we need) and driving tactical activities (where should we place them and whom should they meet)
  • Advanced tools = software such as sas, JMP, mapinfo – definitely not toys like Excel
  • Advanced techniques = more than just means and guesses – NLMIXED, Time series analysis, Neural
  • Networks, factor analysis, etc.

Another of my attempts is “Increasing revenue and reducing costs through evidence based analytics

Wednesday 8 February 2012

Using Geo-Special Awareness to Get That Extra Edge Out of Predictive Analytics

The way to get that extra edge out of the analysis is to get your hands on the key drivers, transform them wisely and exploit the correlations. The data mining tools are very good at the first steps for most types of data. However, two main gaps are still awaiting a proper answer: temporal correlation and spatial correlation. An experienced statistician can handle this gap by clever data manipulation and returning to the good old sas-stat & sas-ets to use the advanced modelling approaches such as nlmixed and arima.

However, it is important to be able to clean and transform spatial information such as the location of practices a sales rep has visited, the geo-demographic profile of the practice catchment, the regulatory environment for this practice, or the influence of the nearest hospitals and the specialists working in them. Sas has very elementary tool to handle mappable information  such as kriging, point-in-polygon and map rendering procedures. However, it feels like sas did not push developing this aspect of analytics very hard. Especially after the agreement with ESRI [http://www.esri.com/] (the sas-bridge to ESRI - http://www.sas.com/products/bridgeforesri/). I found an announcement from 2002 - http://www.esri.com/news/arcnews/winter0203articles/sas-and-esri.html. I got to try out the bridge around 2004 and was bitterly disappointed as it was very clunky and did not really allow for proper seamless feel. At the time I also experimented with sending queries to MS-sql-server (that was augmented with the spatial analysis pack) and with writing MapBasic code on the fly within a sas-session, compiling it and calling MapInfo to execute it using data exported from sas to csv (Ha!). The latter is my current preferred mode of work but it has obvious short comings. The one that is unexpected is that one cannot automate drive time calculations in Mapinfo and boy do I need to do this now.

Blair Freebairn of GeoLytix (http://geolytix.co.uk/) stopped over a few days ago and we had an interesting discussion exploring the need for dynamic interaction between an analytical package such as sas and a GIS software such as Arc-View (ESRI). Many of the application we thought up really need a once in a while processing such as identifying drive time catchments, joining in Mosaic (geo-demographic - http://www.experian.co.uk/business-strategies/mosaic-uk-2009.html) and aggregating up using appropriate weights. That could be done once a quarter and presented to sas as a csv to augment any analysis mart. Or fitting a predictive model once a week and implementing the real-time scoring in the GIS software. However, I can envision a situation were data should go back and forth seemlessly to effectively use the strengths of sas and a GIS platform – not just for reporting purposes. Please feel free to share your experience and thoughts in the comments stream.

I hear there is a new version of the Bridge to ESRI – anybody out there experienced it?