Business Intelligence Integration Joel Da Costa, Takudzwa Mabande, Richard Migwalla Antoine Bagula, Joseph Balikuddembe Project Description Business Intelligence (BI) is the practice of using computer software to aid data analysis and decision making in businesses. It represents a set of processes, tools and technologies which improve productivity, sales and service of an enterprise, and so profitability in general. BI works primarily by collecting, organizing and analyzing corporate data and then creating useful knowledge out this analysis (reporting). BI as a whole incorporates a wide spectrum of software functions including ad-hoc querying, on-line analytical processing (OLAP), dashboards, scorecards, search, visualization and more. BI differentiates itself through its interdepartmental focus and general overview which is geared towards total business performance. The implementation of BI gives knowledge and understanding to departmental groups which previously may not have had access to or understanding of the data. Increased analytics and ad hoc reporting allow organisations to better understand trends within their business and apply a variety of different measures and attributes to understanding these trends. Once the BI system has been implemented, a company will typically find it has more ideas for new initiatives, more efficient and precise data collection processes, more effective marketing techniques and a better understanding of its customers‟ needs and characteristics, and a better understanding of the state of the market. This improved business agility and efficiency through BI results in a long term performance gain which can result in significant profit increases. The BI system itself is typically segmented into several key areas. The first is Business Modelling in order to create the framework of the system and how the information flows need to be established. Data warehouses are used as a centralized repository for all the data gathered, and maintained through the 'Extraction, Transformation and Loading' (ETL) processes. OLAP is a technique by which the data sourced from the data warehouse is visualized and summarized to provide a perspective view across multiple dimensions in order to quickly answer multi-dimensional queries. Essentially, OLAP tells a business what has happened, and Data mining explains why it happened, and what is likely to happen in the future based on past patterns. Problem Statement The project is going to focus on the underlying technologies which enable Business Intelligence (BI) and their application to two key scenarios. Previously, various technologies have been developed and implemented from a “one size fits all” approach, but this approach is likely to result in less effective and accurate analysis. Different areas and analyses require more adaptability rather than such a singular approach. Our aim then is to evaluate which technologies would be the most effective for the particular cases. The technologies being evaluated are Bayesian Belief Networks, Neural Networks and Artificial Immune Systems which will be expanded on later in the proposal. The project will be done in cooperation with Sanlam, who will provide the necessary data to be analysed. The first case is to analyse customer data in order to create profiles of them so that they may be targeted with the correct marketing techniques. By doing so, it would allow an increase in sales as well as a decrease in the cost of marketing. Using the data provided, the 3 different technologies will be applied to try and gain the most accurate customer profiles. The second case is „Predictive Sales Forecasting‟. Using historical and current data, the same 3 technologies will be applied to try and create an accurate forecast of the future trends. Forecasting allows better business decisions to be made and mitigation to be taken in order to improve the likely outcome. While the cases operate individually, they are all being implemented with the same aim. What we want to ascertain from the project results is the variance of each approach's results when measured against the same data and also bench-marked on the known sales figures. This will help define the strengths and weaknesses of the particular technologies in developing BI functions. Procedures and Methods This project is primarily designed as a research venture, with the main objective being the synthesis of usable research results, as per the Problem statement. The scope does however extend beyond that, as shall be illustrated through the following breakdown of procedures and objectives. Implementation will occur in the form of a java application. It will make use of 3 different intelligent systems to analyse historical data provided by Sanlam, to predict the required output. Rationale for chosen approach Before considering the actual approach that will be used in addressing the Problem statement, it is first necessary to mention the reasoning behind the choice of algorithms. As the research conducted indicated, industry tends to favour the use of these 3 algorithms, particularly exhibiting a distinct liking to Bayesian Belief networks. Furthermore, as per the initial meeting with the Sanlam representative, these 3 algorithms are of particular interest to Sanlam. More detail as to industry‟s use of these algorithms can be seen in the related works section. Thus, the next step is to elaborate on the chosen approach, i.e. the choice to address the problem in the form of an application. The following points summarize the motivation behind this: Application development allows further extensibility: By choosing to develop this project in the form of an application, there is more room to generalize and adapt the application, making it useful in other spheres of Business. Extensibility also allows room for improvement. Thus, developing this application allows room for continuation, evolution and progress. Providing Sanlam with a concrete showing of the results obtained, as well as how they were obtained is also reason for developing this application. Recreating the results of this research experiment will also be made easier given the platform of an application. Development process Because of the collaborative nature of this project, it is key that the primary stakeholders i.e. Sanlam, submit a clear description of their requirements and expectations. For this reason, the project will involve a Sanlam delegate, as well as the project team. Meetings will be held with the delegate in order to generate a specific set of user requirements from which the solution can be derived. Once the requirements have been finalised, the next phase will then be implemented. This will consist of developing the application which will model abstractions of the selected intelligent systems. The application will have various forms of clientele information, (provided by Sanlam) as input. This information will include elements such as Incomes, Premiums as well as purchasing history, to name a few. Based on this input, the application will then use the embedded intelligent systems to generate output, offering the user the choice as to which algorithm is applied in the simulation. This functionality will thus allow for comparison of results. The output will be displayed in a format that is relevant to business users, and a graphical user interface will be implemented as part of the application. By hiding a significant portion of the underlying technicalities, and displaying only what is relevant to shareholders and other business analysts the interface will thus achieve its functionality (more detail on this is provided later). Ethical, Professional and Legal Issues Ethical Issues We will be using Sanlam sales and customer data which is to remain confidential. It may not be redistributed to any external parties and no personal information may be extracted for use outside the project. For demonstration purposes, the software may not display personal information that may lead to the identification of particular individuals. This information may be used in the generation of results/forecasts but it will be abstracted with the use of IDs for names if necessary. Legal Issues All sales and customer data from Sanlam must be kept private within the realms of the project. Any copies of the database must be deleted once testing is completed and may not be archived outside of Sanlam. No copies of the database may be created for use outside the project for any purpose. Related Work Customer Profiling Sebastiani et al. used Bayesian Networks to profile customers in order to predict profits. They used two networks: the first to describe the probability of response from customers, and the second to model price factors. The results were reasonable, and by understanding the characteristics of customers, the models thus help to potentially increase profits [1]. Similar work has been done by Elalfi et al. who combined Bayesian networks with genetic algorithms. An algorithm was used to extract accurate and comprehensible rules from a database using trained artificial neural networks, which in turn were trained by genetic algorithms to find the optimal values for the model. These rules were then used to define customer profiles in order to make for more profitable e-business [2]. Customer life cycles Baesens et al. introduces a measure of a customer‟s future spending evolution that might improve relationship marketing decision making. The method suggested predicts whether a customer will increase or decrease spending from their initial purchase information. It had a 75% classification accuracy in predicting the customer lifecycle using purchase volume and purchase category [3]. Repeat Purchase Modeling Baesens et al. focuses on the need for companies such as mail-order companies to identify which customers are most likely to purchase before they send out costly catalogues. This involves profiling customers according to several parameters and calculating the probability of repurchase. A Bayesian Neural Network was used and had a correct classification result of 71% given the data set used [4]. Modelling Customer Attitudes Ishigaki et al. use Bayesian networks to model customer attitudes based on questionnaire data. The model can then be used to gauge customers‟ feelings towards a product, and how they should be marketed to. The model was fairly successful with a 73.5% success rate on testing [5]. Sales Forecasting Recently, Chang et al. developed on the idea of sales forecasting by including clustering in the model. The K-mean technique is used to cluster the data, which is then used with a fuzzy neural network, which once trained, can generate sales forecasts. The model proved very effective in providing accurate forecasts, and was more accurate than a series of other models it was tested against [6]. Anticipated Outcomes We will create a package that will read in data from the Sanlam database, use different machine learning techniques to profile customers and compare the accuracy of the different techniques using actual data. System The software will be composed of: An interface to the database that will read in relevant data. The core of the program that will contain three different Intelligent System techniques that a user can utilize. The front-end interface that will give the user results of the classification comparing actual data to inferred information. The major component will be the implementation of the different techniques. However, it will still be important to have good interfaces with the database and the user. The user interface will need to display interpretable information on the performance of each technique, which will entail aggregating the results in a way that a user will quickly and easily understand. It will need to allow changes in parameters to allow optimisation for particular data sets. Expected Impact We expect to identify the best machine learning technique to use for customer profiling and sales forecasting for Sanlam in particular. From our initial investigation it seems that Bayesian networks are very good classifiers (useful in customer profiling) and neural networks are very good forecasters. The performance of each technique however is highly dependent on the task, data and results required. This may mean that the performance results in Sanlam‟s case will not necessarily match the results for other organisations/companies. Key Success Factors The results of the simulations will need to be compared to existing data of what the simulations are trying to predict. The comparisons will be used to rank each technique according to accuracy of its results. All simulations will be expected to complete within an acceptable time frame (performance and scalability are out of scope for this project but each implementation will need to run within a determined acceptable time, thus making performance negligible in determining the best technique to use). Project Plan Risk Management The risks that follow are to be evaluated based on the following risk Matrix Probability Low Medium Disastrous C B Impact Serious D C High A B Marginal E D C Trivial F E D The following table gives a breakdown of the predicted risks associated with this project, paying special attention to their impact and probability. It also highlights 2 courses of action: Avoidance that is an on-going process as well as mitigation should the risk materialize. Risk Matrix Avoidance Evaluation Loss of a project team member. (This would occur if one or more members abandoned the Honours Programme for any number of reasons) Have sufficiently D. Serious/ Pressure to stay on the independent deliverable Low project as failure to do so modules for each team Probability means not graduating. member. 2. Delay in Delivery of test data. (Dependent on Sanlam for DataExternal factor) C. Pressure Sanlam to Disastrous/ provide data as soon as Low possible. Probability Create random test data or use alternative available data. 3. Scope creep (Plan too many tasks, Cannot complete tasks in time) E. Marginal/ Project planned in detail Low with supervisor and Probability department approval. Start with fundamental features first and leave other things to the end. 1. 4. Data loss due to hardware failure, (External Factor) C. Serious/ Medium Probability Frequent backups of all progress on different machines or storage devices. C. Serious/ Medium Probability Review and reassess Constant reference to the deadlines; readjusting project timeline and clear where necessary- as communication between cost-effectively as project members possible. 5. Missing project deadlines 6. Mitigation Misunderstanding User requirements. D. Serious/ (Resultant of Low miscommunication/ Probability ambiguity in user-team interaction) Constant communication with Sanlam to maintain correct direction. Also, providing Sanlam with project plan and design in order to detect flaws. Roll back to last backup. Iterations through development so that inconsistencies can be detected early. Timeline & Gantt Chart Resources Required The resources required to complete the project are fairly standard, with the software and equipment in the Honours Lab sufficing for development. Apart from this though, Joseph Balikuddembe is necessary as a representative of Sanlam and as co-supervisor for the project. Furthermore, the data regarding customers and sales that Sanlam will provide is crucial to the project development. Necessary Resources: PC‟s Sanlam Database Access Java Development Platform Deliverables The following Table illustrates a detailed list of the deliverables necessary for the completion of this project: Deliverable: Final Project Proposal Project Proposal Presentation Project Web Presence Project Poster Project Web Page Project Report Project Application Description: Final copy of Proposal for evaluation. Presentation to supervisor and class. Online availability of proposal and project timeline. Poster representation of Project. Open Availability of Project Webpage. A report on the results of the research. The actual project. Further detail as to dates can be seen as per the Milestones. Milestones Milestones Literature Synthesis Project Proposal Project Proposal Presentations Finalized Project Proposal Project Web Presence Prototype Background/Theory Chapter Design Chapter Database interface setup Customer Profiling BI Techniques Sales Forecasting BI Techniques Visualization/GUI First Implementation Final Prototype Chapters on Implementation and Testing Outline of complete report Final Complete Draft of Report Poster Web Page Reflection Paper Project Demonstrations Final Project Presentations Dates Mon 3-May Wed 12-May Mon 17-May Mon 31-May Tue 1-Jun Fri 4-Jun Fri 22-Jun Mon 6-Jul Tue 20-Jun Tue 24-Aug Tue 14-Sep Fri 17-Sep Mon 20-Sep Wed 29-Sep Mon 4-Oct Mon 11-Oct Mon 25-Oct Thu 04-Nov Mon 8-Nov Fri 12-Nov Wed 03-Nov Thu 18-Nov Work Allocation Joel Da Costa will implement the Bayesian Belief Networks algorithm for the two cases. Additionally, he will handle the necessary implementation of connecting to, or drawing data from the Sanlam database. Takudzwa Mabande will implement the Neural Networks algorithm for the two cases. Additionally, he will handle the usage of data for the Sales Forecasting case, as well as the output visualization for both cases. Richard Migwalla will implement the Artificial Immune Systems algorithm for the two cases. Additionally, he will handle the usage of data for the Customer Profiling case, as well as the general GUI implementation. References [1] Sebastiani P., Ramoni M., Crea A. Profiling your Customers using Bayesian Networks. SIGKDD Explorations 1(2). 91 – 97. [2] Elfalfi A., Haque R., Elalami M. Extracting rules from trained neural network using GA for managing E-business. Applied Soft Computing 4. 65-77 [3] Baesens, B., Verstraeten, G., Van Den Poel, D., Egmont-Petersen, M., Van Kenhove, P. And Vanthienen, J. 2004. Bayesian network classifiers for identifying the slope of the customer lifecycle of long-life customers. European Journal of Operational Research 156, 508-523. [4] Baesens, B., Viaene, S., Van Den Poel, D., Vanthienen, J. And Dedene, G. 2002. Bayesian neural network learning for repeat purchase modelling in direct marketing. European Journal of Operational Research 138, 191-211. [5] Ishigaki T., Motomura Y., Dohi M., Kouchi M., Mochimaru M. Knowledge Extraction by Probabilistic Cognitive Structure Modeling Using a Bayesian Network for Use by a Retail Service. MEDES October 2009. 141-149 [6] Chang P, Lio C, Fan C. Data clustering and fuzzy neural network for sales forecasting: A case study in printed circuit board industry. Knowledge Based Systems 22. 344- 355.
© Copyright 2024