Front cover Patterns: Portal Searchh Custom Design gn Applying the Information Aggregation patterns to portal search solutions Hints/tips for using IBM search technologies A portal search scenario William Tworek Christopher Desforges Robert Bell Raghu Krishnaswamy ibm.com/redbooks International Technical Support Organization Patterns: Portal Search Custom Design April 2004 SG24-6881-00 Note: Before using this information and the product it supports, read the information in “Notices” on page ix. First Edition (April 2004) © Copyright International Business Machines Corporation 2004. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Part 1. Introductory material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1. Patterns for e-business introduction . . . . . . . . . . . . . . . . . . . . . 3 1.1 The IT architect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 The Patterns for e-business layered asset model . . . . . . . . . . . . . . . . . . . . 4 1.3 How to use the Patterns for e-business . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Select a Business, Integration, or Composite pattern, or a Custom design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 Select Application patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.3 Review Runtime patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.4 Review Product mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.5 Review guidelines and related links . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 2. Portal composite pattern and custom designs introduction . 17 2.1 Introduction to the Portal composite pattern . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.1 Business drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.2 Jump-start portal questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.3 IT drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Understanding the Patterns for e-business . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Portal custom designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Access Integration pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Self-Service business pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Collaboration business pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.4 Information Aggregation business pattern . . . . . . . . . . . . . . . . . . . . 27 2.3.5 Extended Enterprise business pattern . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.6 Application Integration pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.7 Portal characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.8 The Portal composite pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.9 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.10 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 © Copyright IBM Corp. 2004. All rights reserved. iii 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Part 2. Portal Search custom design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 3. The Portal Search custom design . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 What is a Custom design? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 The need for portal search capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Technology drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 The Custom design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Chapter 4. Application patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 An overview of the Application patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Application Integration patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Population: Single Step, Multi-step, and Data Cleansing . . . . . . . . . 46 4.2.2 Population: Index Population application pattern . . . . . . . . . . . . . . . 50 4.2.3 Population: Synchronization application pattern . . . . . . . . . . . . . . . . 54 4.2.4 Federation application pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Information Aggregation patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.1 User Information Access application pattern. . . . . . . . . . . . . . . . . . . 57 4.3.2 User Search and Discovery application pattern . . . . . . . . . . . . . . . . 61 4.3.3 Self-Service application patterns compared . . . . . . . . . . . . . . . . . . . 63 4.4 Combining the patterns for search solutions . . . . . . . . . . . . . . . . . . . . . . . 63 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 5. Runtime patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 Runtime node descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Runtime pattern for the Portal composite pattern . . . . . . . . . . . . . . . . . . . 72 5.3 Runtime pattern for Portal Search custom design. . . . . . . . . . . . . . . . . . . 73 5.4 Application Integration Runtime patterns . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.1 Population: Index Population Runtime pattern . . . . . . . . . . . . . . . . . 76 5.4.2 Federation Runtime pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Information Aggregation Runtime patterns . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5.1 User Search and Discovery Runtime pattern . . . . . . . . . . . . . . . . . . 86 5.5.2 Information Aggregation in business intelligence solutions. . . . . . . . 90 5.6 Combining the Runtime patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Chapter 6. Portal Search product mappings. . . . . . . . . . . . . . . . . . . . . . . . 93 6.1 Mapping the Runtime pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1.1 Functional mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1.2 Product mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1.3 Network protocol mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Product descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 iv Patterns: Portal Search Custom Design 6.2.1 Lotus Extended Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.2 DB2 Information Integrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2.3 Lotus Domino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2.4 Lotus Discovery Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2.5 WebSphere Application Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.6 WebSphere Portal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.7 WebSphere Portal Search Engine (Juru) . . . . . . . . . . . . . . . . . . . . 104 6.3 Choosing the product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Part 3. Solution guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Chapter 7. Technology considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.1 Query syntax support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2 Support for a common data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.3 Simple versus advanced index creation . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4 Honoring the security of data sources. . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.5 Source discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.6 Performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.7 Client features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.8 Client technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.8.1 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.8.2 Dynamic HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.8.3 JavaScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.8.4 Java applets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.8.5 Java servlets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.8.6 JavaServer Pages (JSPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.8.7 JavaBeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.8.8 XML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.8.9 Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Chapter 8. Application design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.2 WebSphere Portal Services architecture diagram . . . . . . . . . . . . . . . . . 135 8.2.1 Single-Tier versus Multi-Tier design . . . . . . . . . . . . . . . . . . . . . . . . 136 8.3 Portal solution guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.3.1 Model-View-Controller design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.3.2 Content management guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.3.3 Single sign-on guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.3.4 Collaboration guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.3.5 Web services guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.5 Where to find more information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Contents v Part 4. Technical scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Chapter 9. “Chrisco Books” scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.1 Chrisco Books scenario: story line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 9.2 Chrisco Books scenario: requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.2.1 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.2.2 Non-functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.2.3 Summary of requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.3 Patterns mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.3.1 Examining the business requirements . . . . . . . . . . . . . . . . . . . . . . 159 9.3.2 Solution options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.3.3 Integrating the solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.4 Expanding the scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Chapter 10. Technical implementation of the scenario . . . . . . . . . . . . . . 167 10.1 The runtime environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 10.2 The Lotus Domino server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 10.3 The IBM Content Manager server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 10.4 The Lotus Extended Search server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 10.4.1 Internet and Intranet data source setup . . . . . . . . . . . . . . . . . . . . 179 10.4.2 Domino application data source setup . . . . . . . . . . . . . . . . . . . . . 189 10.4.3 IBM Content Manager data source setup . . . . . . . . . . . . . . . . . . . 190 10.5 The WebSphere Portal server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 10.6 Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Part 5. Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Appendix A. Pattern changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Appendix B. Understanding the Lotus Extended Search architecture . 207 Extended Search architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Links and translators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Brokers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Configuration database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Appendix C. Using the WebSphere Portal Search Engine . . . . . . . . . . . 219 How to set up Portal Search in WebSphere Portal Server. . . . . . . . . . . . . . . 220 Creating the Search page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Building a Juru Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Setting up permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Configuring the crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 vi Patterns: Portal Search Custom Design Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Other resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Referenced Web sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 IBM Redbooks collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Contents vii viii Patterns: Portal Search Custom Design Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces. © Copyright IBM Corp. 2004. All rights reserved. ix Trademarks The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: AIX® DB2® DB2 Information Integrator™ DB2 Universal Database™ Domino™ Domino.Doc® EDMSuite™ ^™ IBM® ibm.com® ImagePlus® Informix® iSeries™ Lotus® Lotus Discovery Server™ Lotus Notes® Notes® OS/390® Redbooks™ Redbooks (logo) Sametime® SmartSuite® VisualInfo™ WebSphere® z/OS® ™ The following terms are trademarks of other companies: Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others. x Patterns: Portal Search Custom Design Preface The Patterns for e-business are a group of proven, reusable assets that can speed the process of developing applications. The Portal Search custom design builds off the Portal composite pattern, combining Business and Integration patterns to help implement a portal search solution. This IBM Redbook provides a technical scenario and guidelines for the Portal Search custom design. It also shows how the Portal Search custom design works, and documents the tasks required to build a technical scenario of it. Part 1 provides introductory material around the IBM Patterns for e-business, and the Portal composite pattern on which this custom design is based. Part 2 guides you through the process of choosing the Business and Integration patterns of the custom design and then drills down to the Application and Runtime pattern and Product mapping to deliver the desired functionality. Part 3 provides a set of guidelines for implementing and building a portal search solution, including a discussion of search technology selection criteria as well as application design and development. Part 4 demonstrates how to implement a portal search solution via a technical scenario. This technical scenario uses the WebSphere® Portal Extend offering, combined with Lotus® Extended Search. Finally, the appendix of this redbook provides some additional technical details around some of the products used in this custom design, including: Lotus Extended Search and the WebSphere Portal Search Engine technology. The team that wrote this redbook This redbook was produced by a team of specialists from around the world working at the International Technical Support Organization, Cambridge Center. William Tworek is a Project Leader with the International Technical Support Organization, working out of Westford, Massachusetts. He provides management and technical leadership for projects that produce IBM Redbooks™ on various topics involving IBM and Lotus Software technologies. Prior to joining the ITSO, he was an IT Architect in the consulting industry working for Andersen Consulting/Accenture, followed by IBM Software Services for Lotus. His areas of expertise include collaborative technologies and business portals, system integration, and systems infrastructure design. © Copyright IBM Corp. 2004. All rights reserved. xi Christopher Desforges is a Consulting IT Architect with IBM Software Services for Lotus, working out of New York. Robert Bell is an Advisory IT Specialist working with IBM Software Services for Lotus, working out of California. Raghu Krishnaswamy is Senior Software Engineer with IBM Global Services India. He holds a Bachelor's Degree in Electronic and Communication Engineering, and has experience in Application and Frameworks Architecture. Thanks to the following people for their contributions to this project: Jonathan Adams, Distinguished Engineer, IBM UK, Software Group Technical Strategy Michele Galic, WebSphere Specialist, International Technical Support Organization, Raleigh NC David Bryant, DB2/Business Intelligence Consultant, IBM® UK Todd Leyba, Architect, Extended Search Technology and Development, IBM SWG Dana Morris, Advisory Software Engineer, Extended Search Technology and Development, IBM SWG Yvonne Lyon, Technical Editor, International Technical Support Organization, San Jose CA Become a published author Join us for a two- to six-week residency program! Help write an IBM Redbook dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You'll team with IBM technical professionals, Business Partners and/or customers. Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you'll develop a network of contacts in IBM development labs, and increase your productivity and marketability. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html xii Patterns: Portal Search Custom Design Comments welcome Your comments are important to us! We want our Redbooks to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways: Use the online Contact us review redbook form found at: ibm.com/redbooks Send your comments in an Internet note to: [email protected] Preface xiii xiv Patterns: Portal Search Custom Design Part 1 Part 1 Introductory material Note: In the first part of this redbook, we introduce you to the IBM Patterns for e-business, and the Portal composite pattern on which this book is based. Those already familiar with the IBM Patterns for e-business, or the Portal composite pattern, may want to skip forward to Part 2, “Portal Search custom design” on page 33. © Copyright IBM Corp. 2004. All rights reserved. 1 2 Patterns: Portal Search Custom Design 1 Chapter 1. Patterns for e-business introduction This redbook is part of the Patterns for e-business series. In this introductory chapter we provide an overview of how IT architects can work effectively with the Patterns for e-business. © Copyright IBM Corp. 2004. All rights reserved. 3 1.1 The IT architect The role of the IT architect is to evaluate business problems and to build solutions to solve them. To do this, the architect begins by gathering input on the problem, an outline of the desired solution, and any special considerations or requirements that need to be factored into that solution. The architect then takes this input and designs the solution. This solution can include one or more computer applications that address the business problems by supplying the necessary business functions. To enable the architect to do this better each time, we need to capture and reuse the experience of these IT architects in such a way that future engagements can be made simpler and faster. We do this by taking these experiences and using them to build a repository of assets that provides a source from which architects can reuse this experience to build future solutions, using proven assets. This reuse saves time, money, and effort, and in the process, helps ensure delivery of a solid, properly architected solution. The IBM Patterns for e-business helps facilitate this reuse of assets. Their purpose is to capture and publish e-business artifacts that have been used, tested, and proven. The information captured by them is assumed to fit the majority, or 80/20, situation. The IBM Patterns for e-business are further augmented with guidelines and related links for their better use. The layers of patterns plus their associated links and guidelines allow the architect to start with a problem and a vision for the solution, and then find a pattern that fits that vision. Then, by drilling down using the patterns process, the architect can further define the additional functional pieces that the application will need to succeed. Finally, he can build the application using coding techniques outlined in the associated guidelines. 1.2 The Patterns for e-business layered asset model The Patterns for e-business approach enables architects to implement successful e-business solutions through the re-use of components and solution elements from proven successful experiences. The Patterns approach is based on a set of layered assets that can be exploited by any existing development methodology. These layered assets are structured in a way that each level of detail builds on the last. These assets include: Business patterns that identify the interaction between users, businesses, and data. 4 Patterns: Portal Search Custom Design Integration patterns that tie multiple Business patterns together when a solution cannot be provided based on a single Business pattern. Composite patterns that represent commonly occurring combinations of Business patterns and Integration patterns. Application patterns that provide a conceptual layout describing how the application components and data within a Business pattern or Integration pattern interact. Runtime patterns that define the logical middleware structure supporting an Application pattern. Runtime patterns depict the major middleware nodes, their roles, and the interfaces between these nodes. Product mappings that identify proven and tested software implementations for each Runtime pattern. Best-practice guidelines for design, development, deployment, and management of e-business applications. These assets and their relation to each other are shown in Figure 1-1. Customer requirements Composite patterns Business patterns Integration patterns gy olo od eth yM An Application patterns Runtime patterns Product mappings Best-Practice Guidelines Application Design Systems Management Performance Application Development Technology Choices Figure 1-1 The Patterns for e-business layered asset model Chapter 1. Patterns for e-business introduction 5 Patterns for e-business Web site The Patterns Web site provides an easy way of navigating top down through the layered Patterns’ assets in order to determine the preferred reusable assets for an engagement. For easy reference to Patterns for e-business, refer to the Patterns for e-business Web site at: http://www.ibm.com/developerWorks/patterns/ 1.3 How to use the Patterns for e-business As described in the last section, the Patterns for e-business are a layered structure where each layer builds detail on the last. At the highest layer are Business patterns. These describe the entities involved in the e-business solution. Composite patterns appear in the hierarchy shown in Figure 1-1 on page 5 above the Business patterns. However, Composite patterns are made up of a number of individual Business patterns, and at least one Integration pattern. In this section, we discuss how to use the layered structure of Patterns for e-business assets. 1.3.1 Select a Business, Integration, or Composite pattern, or a Custom design When faced with the challenge of designing a solution for a business problem, the first step is to take a high-level view of the goals you are trying to achieve. A proposed business scenario should be described and each element should be matched to an appropriate IBM Pattern for e-business. You may find, for example, that the total solution requires multiple Business and Integration patterns, or that it fits into a Composite pattern or Custom design. For example, suppose an insurance company wants to reduce the amount of time and money spent on call centers that handle customer inquiries. By allowing customers to view their policy information and to request changes online, they will be able to cut back significantly on the resources spent handling this by phone. The objective is to allow policyholders to view their policy information stored in legacy databases. The Self-Service business pattern fits this scenario perfectly. It is meant to be used in situations where users need direct access to business applications and data. Let’s take a look at the available Business patterns. 6 Patterns: Portal Search Custom Design Business patterns A Business pattern describes the relationship between the users, the business organizations or applications, and the data to be accessed. There are four primary Business patterns, explained in Figure 1-2. Business Patterns Description Self-Service (User-to-Business) Applications where users interact with a business via the Internet or intranet Simple Web site applications Information Aggregation (User-to-Data) Applications where users can extract useful information from large volumes of data, text, images, etc. Business intelligence, knowledge management, Web crawlers Applications where the Internet supports collaborative work between users E-mail, community, chat, video conferencing, etc. Applications that link two or more business processes across separate enterprises EDI, supply chain management, etc. Collaboration (User-to-User) Extended Enterprise (Business-to-Business) Examples Figure 1-2 The four primary Business patterns It would be very convenient if all problems fit nicely into these four slots, but reality says that things will often be more complicated. The patterns assume that most problems, when broken down into their most basic components, will fit more than one of these patterns. When a problem requires multiple Business patterns, the Patterns for e-business provide additional patterns in the form of Integration patterns. Integration patterns Integration patterns allow us to tie together multiple Business patterns to solve a business problem. The Integration patterns are outlined in Figure 1-3. Chapter 1. Patterns for e-business introduction 7 Integration Patterns Description Examples Access Integration Integration of a number of services through a common entry point Portals Application Integration Integration of multiple applications and data sources without the user directly invoking them Message brokers, workflow managers Figure 1-3 Integration patterns These Business and Integration patterns can be combined to implement installation-specific business solutions. We call this a Custom design. Custom design Self-Service Collaboration Information Aggregation Extended Enterprise Application Integration Access Integration We can represent the use of a Custom design to address a business problem through an iconic representation as shown in Figure 1-4. Figure 1-4 Patterns representing a Custom design If any of the Business or Integration patterns are not used in a Custom design, we can show that with the blocks being lighter than the other ones. For example, Figure 1-5 shows a Custom design that does not have a Collaboration business pattern or an Extended Enterprise business pattern for a business problem. 8 Patterns: Portal Search Custom Design Collaboration Information Aggregation Extended Enterprise Application Integration Access Integration Self-Service Figure 1-5 Custom design with Self-Service, Information Aggregation, Access Integration, and Application Integration A Custom design may also be a Composite pattern if it recurs many times across domains with similar business problems. For example, the iconic view of a Custom design in Figure 1-5 can also describe a Sell-Side Hub composite pattern. Composite patterns Several common uses of Business and Integration patterns have been identified and formalized into Composite patterns. The identified Composite patterns are shown in Figure 1-6. Chapter 1. Patterns for e-business introduction 9 Composite Patterns Electronic Commerce Description User-to-Online-Buying Examples www.macys.com www.amazon.com Enterprise Intranet portal providing self-service functions such as payroll, benefits, and travel expenses. Collaboration providers who provide services such as e-mail or instant messaging. Portal Typically designed to aggregate multiple information sources and applications to provide uniform, seamless, and personalized access for its users. Account Access Provide customers with around-the-clock account access to their account information. Online brokerage trading apps. Telephone company account manager functions. Bank, credit card and insurance company online apps. Trading Exchange Allows buyers and sellers to trade goods and services on a public site. Buyer's side - interaction between buyer's procurement system and commerce functions of e-Marketplace. Seller's side - interaction between the procurement functions of the e-Marketplace and its suppliers. Sell-Side Hub (Supplier) The seller owns the e-Marketplace and uses it as a vehicle to sell goods and services on the Web. Buy-Side Hub (Purchaser) The buyer of the goods owns the e-Marketplace and uses it as a vehicle to leverage the buying or procurement budget in soliciting the best deals for goods and services from prospective sellers across the Web. www.carmax.com (car purchase) www.wre.org (WorldWide Retail Exchange) Figure 1-6 Composite patterns The makeup of these patterns is variable in that there will be basic patterns present for each type, but the Composite can easily be extended to meet additional criteria. For more information on Composite patterns, refer to Patterns for e-business: A Strategy for Reuse by Jonathan Adams, Srinivas Koushik, Guru Vasudeva, and George Galambos. 10 Patterns: Portal Search Custom Design 1.3.2 Select Application patterns Once the Business pattern is identified, the next step is to define the high-level logical components that make up the solution and how these components interact. This is known as the Application pattern. A Business pattern will usually have multiple possible Application patterns. An Application pattern may have logical components that describe a presentation tier for interacting with users, an application tier, and a back-end application tier. Application patterns break the application down into the most basic conceptual components, identifying the goal of the application. In our example, the application falls into the Self-Service business pattern and the goal is to build a simple application that allows users to access back-end information. The Application pattern shown in Figure 1-7 fulfills this requirement. Presentation synchronous Read/Write data Web Application synch/ asynch Application node containing new or modified components Back-End Application 2 Back-End Application 1 Application node containing existing components with no need for modification or which cannot be changed Figure 1-7 Self -Service::Directly Integrated Single Channel The Application pattern shown consists of a presentation tier that handles the request/response to the user. The application tier represents the component that handles access to the back-end applications and data. The multiple application boxes on the right represent the back-end applications that contain the business data. The type of communication is specified as synchronous (one request/one response, then next request/response) or asynchronous (multiple requests and responses intermixed). Suppose that the situation is a little more complicated than that. Let's say that the automobile policies and the homeowner policies are kept in two separate and dissimilar databases. The user request would actually need data from multiple, disparate back-end systems. In this case there is a need to break the request Chapter 1. Patterns for e-business introduction 11 down into multiple requests (decompose the request) to be sent to the two different back-end databases, then to gather the information sent back from the requests, and then put this information into the form of a response (recompose). In this case the Application pattern shown in Figure 1-8 would be more appropriate. Back-End Application 2 Presentation Application node containing new or modified components synchronous Decomp/ Recomp Transient data - Work in progress - Cached committed data - Staged data (data replication flow) synch/ asynch Back-End Application 1 Application node containing existing components with no need for modification or which cannot be changed Read/ Write data Figure 1-8 Self-Service::Decomposition This Application pattern extends the idea of the application tier that accesses the back-end data by adding decomposition and recomposition capabilities. 1.3.3 Review Runtime patterns The Application pattern can be further refined with more explicit functions to be performed. Each function is associated with a runtime node. In reality these functions, or nodes, can exist on separate physical machines or may co-exist on the same machine. In the Runtime pattern this is not relevant. The focus is on the logical nodes required and their placement in the overall network structure. As an example, let's assume that our customer has determined that his solution fits into the Self-Service business pattern and that the Directly Integrated Single Channel pattern is the most descriptive of the situation. The next step is to determine the Runtime pattern that is most appropriate for his situation. He knows that he will have users on the Internet accessing his business data and he will therefore require a measure of security. Security can be implemented at various layers of the application, but the first line of defense is almost always one or more firewalls that define who and what can cross the physical network boundaries into his company network. 12 Patterns: Portal Search Custom Design He also needs to determine the functional nodes required to implement the application and security measures. The Runtime pattern shown in Figure 1-9 is one of his options. Demilitarized Zone (DMZ) Outside World Public Key Infrastructure Web Application Server Domain Firewall User Directory and Security Services Protocol Firewall Domain Name Server I N T E R N E T Internal Network Existing Existing Applications Applications andData Data and Directly Integrated Single Channel application Presentation Application Application Application Figure 1-9 Directly Integrated Single Channel application pattern::Runtime pattern By overlaying the Application pattern on the Runtime pattern, you can see the roles that each functional node will fulfill in the application. The presentation and application tiers will be implemented with a Web application server, which combines the functions of an HTTP server and an application server. It handles both static and dynamic Web pages. Application security is handled by the Web application server through the use of a common central directory and security services node. Chapter 1. Patterns for e-business introduction 13 A characteristic that makes this Runtime pattern different from others is the placement of the Web application server between the two firewalls. The Runtime pattern shown in Figure 1-10 is a variation on this. It splits the Web application server into two functional nodes by separating the HTTP server function from the application server. The HTTP server (Web server redirector) will serve static Web pages and redirect other requests to the application server. It moves the application server function behind the second firewall, adding further security. Demilitarized Zone (DMZ) Outside World Internal Network Public Key Infrastructure Web Server Redirector Domain Firewall User Protocol Firewall Domain Name Server I N T E R N E T Directory and Security Services Application Server Existing Existing Applications Applications andData Data and Directly Integrated Single Channel application Presentation Application Application Application Figure 1-10 Directly Integrated Single Channel application pattern::Runtime pattern: Variation 1 These are just two examples of the possible Runtime patterns available. Each Application pattern will have one or more Runtime patterns defined. These can be modified to suit the customer’s needs. For example, he/she may want to add a load-balancing function and multiple application servers. 14 Patterns: Portal Search Custom Design 1.3.4 Review Product mappings The last step in defining the network structure for the application is to correlate real products with one or more runtime nodes. The Patterns Web site shows each Runtime pattern with products that have been tested in that capacity. The Product mappings are oriented toward a particular platform, though more likely the customer will have a variety of platforms involved in the network. In this case, it is simply a matter of mix and match. For example, the runtime variation in Figure 1-10 on page 14 could be implemented using the product set depicted in Figure 1-11. Internal network Demilitarized zone Web Server Redirector Domain Firewall Protocol Firewall Outside world Windows 2000 + SP3 IBM WebSphere Application Server V5.0 HTTP Plug-in IBM HTTP Server 1.3.26 Directory and Security Services Windows 2000 + SP3 IBM SecureWay Directory V3.2.1 IBM HTTP Server 1.3.19.1 IBM GSKit 5.0.3 IBM DB2 UDB EE V7.2 + FP5 LDAP Existing Applications and Data Application Server Windows 2000 + SP3 IBM WebSphere Application Server V5.0 JMS Option add: IBM WebSphere MQ 5.3 Database Web Services Option: Windows 2000 + SP3 IBM WebSphere Application Server V5.0 IBM HTTP Server 1.3.26 IBM DB2 UDB ESE 8.1 Web service EJB application JCA Option: z/OS Release 1.3 IBM CICS Transaction Gateway V5.0 IBM CICS Transaction Server V2.2 CICS C-application JMS Option: Windows 2000 + SP3 IBM WebSphere Application Server V5.0 IBM WebSphere MQ 5.3 Message-driven bean application Windows 2000 + SP3 IBM DB2 UDB ESE V8.1 Figure 1-11 Directly Integrated Single Channel application pattern: Windows® 2000 product mapping Chapter 1. Patterns for e-business introduction 15 1.3.5 Review guidelines and related links The Application patterns, Runtime patterns, and Product mappings are intended to guide you in defining the application requirements and the network layout. The actual application development has not been addressed yet. The Patterns Web site provides guidelines for each Application pattern, including techniques for developing, implementing, and managing the application based on the following guidelines: Design guidelines instruct you on tips and techniques for designing the applications. Development guidelines take you through the process of building the application, from the requirements phase all the way through the testing and rollout phases. System management guidelines address the day-to-day operational concerns, including security, backup and recovery, application management, etc. Performance guidelines give information on how to improve the application and system performance. 1.4 Summary The IBM Patterns for e-business are a collective set of proven architectures. This repository of assets can be used by companies to facilitate the development of Web-based applications. They help an organization understand and analyze complex business problems and break them down into smaller, more manageable functions that can then be implemented. 16 Patterns: Portal Search Custom Design 2 Chapter 2. Portal composite pattern and custom designs introduction Organizations strive to achieve the best combination of business process efficiency, deep customer knowledge and mindshare, and product leadership — a combination that best suits their business goals. In order to obtain these goals, organizations leverage role-based portals to provide relevant information to specific audiences. The Portal composite pattern assists in the design process for a portal implementation. Portal custom designs will typically be variations of the Portal composite pattern, which has been extended to meet a specific customer’s requirements. © Copyright IBM Corp. 2004. All rights reserved. 17 2.1 Introduction to the Portal composite pattern A Portal composite pattern leverages various mechanisms (for example, personalization, collaboration, content management, user interface formatting and display, and data aggregation) to bring together the appropriate information and existing systems to serve the goals of the business. For example, when attempting to grow customer mindshare and knowledge, a portal system can bring together the proper information tailored to the type of user that the business would like to target. This can be implemented in a number of ways; however, the cogent point to remember is that when a customer is made to feel that business truly understands his or her needs and wants, they will be retained as a customer. Consequently, the customer’s needs and wants may be provided through the business achieving product leadership, great customer service, or highly efficient transactional processes that support product leadership and/or customer service. The components of personalization, collaboration, multi-device type access, a presentation rendering mechanism, and a business rules engine, are combined with the ability to search and index content (of various types and formats) and management of content via a workflow process to provide both content aggregation and a collaborative environment. A portal can help a business gain marketshare, retain existing customers, and reduce costs through the ability to target the delivery of information to specific user audiences. 2.1.1 Business drivers Business drivers are specific goals that the business is trying to achieve. In most cases, business drivers have an ultimate goal of reducing costs, increasing revenue, or improving productivity. In fact, a business can be any type of organization (for example, manufacturing, research, military, etc.) that seeks to make the best use of its available resources and determine if new resources are required. The design of a portal can help to clarify these goals, and analysis of interactions with the portal can further define and enhance these drivers. Various “paths” that can be followed to achieve the desired results are as follows: Deep customer knowledge and mindshare: This can be thought of as “customer intimacy”. When a business wants to provide the “best customer service” experience and this is their primary driver for revenue, they need to understand their customers and market as much as possible. So it is important to identify these types of customers when designing a portal, and once implemented, a portal can provide valuable knowledge about the habits of the targeted audience. 18 Patterns: Portal Search Custom Design Also, this information can be used to determine if the targeted audience is helping the business achieve its goals. Thus an organization can increase customer retention through deep knowledge of that customer, resulting in increased revenues through more efficient marketing practices. Product leadership: Some organizations want to be the best in their market for the products or services they provide. These organizations want to achieve leadership from a quality and/or marketplace mindshare perspective. One of the common methods for providing product mindshare leadership is by communicating certain information about upcoming products or enhancements to existing products to the targeted audience. In addition, if the business can identify other possible audiences to expand their customer base, this can also contribute to product leadership. A portal can assist in disseminating both the technical and marketing information about the products or services provided, and this information can be tailored to specific user audiences (as defined by demographic and “device type” information). In addition, the usage of the portal by these targeted audiences (customers) can be analyzed to determine if the marketing efforts are successful. Business process efficiency: Organizations that have identified increased efficiency in their internal processes want to attain the highest possible efficiency in the transactions that take place between departments, divisions, employees, and external partners (for example, external suppliers who supply raw material for the products or services being offered). A portal can give people access to the information and business processes that they need in a single, secure, and dynamic environment. This allows the user to: – Access the relevant information in context of the task performed – Collaborate in context of the business process – See consolidated information in a single aggregated view This collaborative portal environment, with its aggregated views of data, provides just the information necessary for the person or entity to gain maximum efficiency in how tasks are accomplished. For example, in sales it is important to present the sales person all the information available on a given customer in context with the products the customer buys and the most recent sales activities with that customer. Often there is also the need for additional information from other sales people or the project team working on site with the customer. Using collaboration tools can help to have better and more responsive interaction within the teams and ulimatively lead to faster decision making. Chapter 2. Portal composite pattern and custom designs introduction 19 We see that a portal implementation requires the identification of the information desired, the audience for that information, and an analysis of the usefulness of that information to fulfill the business drivers of an organization. Organizations may have only one of these business drivers, or there may be a combination of these drivers that will help the organization meet their goals. In addition, concepts such as ease of use (for example, single sign-on), security, and reduced Total Cost of Ownership (TCO) are all examples of specific, tactical goals of a portal implementation that will ultimately support the three core business drivers above. The following are some additional examples of specific goals that can be used to achieve the ultimate drivers for an organization: Time to market Improved organizational efficiency Faster decision making Reduced latency of business events Adaptability during mergers and acquisitions Integration across multiple delivery channels Unified customer view across lines of business Support of effective cross selling Support of mass customization (reducing the cost of customizing products and services) 2.1.2 Jump-start portal questions The Patterns for e-business and specifically the Portal composite pattern assist in the design process for a portal implementation. This allows both the business and technical groups within an organization to ask the jump-start portal questions. They are as follows: 1. Where is the information in our organization located? In order to aggregate information, the location of this information (applications, databases, external sources) must first be determined. 2. Does the information needed currently exist? The business drivers will determine what information is needed. 3. Do you want to enable collaboration and human interaction across all areas of business? Processes and applications don't make decisions — the users do. A collaborative portal environment allows you to integrate human interaction with processes and information. These are capabilities that, for example, let people get the just-in-time advice, education, consensus, and approval they need to respond quickly, to any business situation or emergency. 20 Patterns: Portal Search Custom Design 4. Do you want to enable widespread teams to work together efficiently in the context of the business process? The portal allows you to make your organization’s people, processes, and information readily available to individual teams so they can solve everyday business problems more efficiently. 5. What are the processes by which that information is collected, updated, managed, and disseminated? Portals are based on information that has to be managed from a collection, update, and processing perspective. There are likely existing processes by which information (that already exists in the organization) is collected, and these will have to be examined to determine if they must be modified to support both the business and IT drivers. 6. What defines a portal user? The definition of a user will impact the types of security, the types of data, and the types of client devices that need to be supported. 7. What is to be gained by implementing a portal? This is a direct reference to the business drivers. Once an organization defines what they want to achieve or improve upon, then a discrete set of business goals, or drivers, can be identified. 2.1.3 IT drivers As with all organizations, those concepts that drive the IT organization to make decisions are ultimately driven by the needs of the organization at the business or enterprise level. Those items, described in 2.1.1, “Business drivers” on page 18, can each be supported through the appropriate use of technologies that help implement the following goals: Minimize application complexity Minimize Total Cost of Ownership (TCO) Are open-standards-based Offer an end-to-end solution Leverage existing skills Leverage legacy investment Integrate back-end applications Minimize enterprise complexity Support maintainability Support scalability Support availability Chapter 2. Portal composite pattern and custom designs introduction 21 Many of these IT drivers are focused on cost reduction through minimizing complexity. These can be further abstracted into five core IT drivers as follows: Availability: The IT organization needs to have the solution available as defined in the business drivers. A portal implementation means having the information that the customer wants to see in the way he wants to see it. Therefore, the application needs to be available when the customer wants to see it. Open-standards-based: An open-standards-based infrastructure provides both a choice of platform and the ability to integrate into other vendor’s environments. In addition, it allows the applications that you develop to interact in a much larger environment. Reusability: Reusing existing IT assets, such as programming code, existing applications, and existing data sources, can reduce overall cost. A portal implementation, and specifically the Portal composite pattern, brings together various existing and new systems to construct an end-to-end solution. Maintainability: Maintainability is a goal of the IT organization because shifting business goals will often require adding or deleting functionality. In addition, the sources of information available to a portal system may change. Thus, it is vital that the portal implementation be able to adapt to the changing environment by isolating different systems so that changes to one type of component will not affect other components that make up the portal system. Scalability: The Portal composite pattern is a “best mix” of nodes and components that lead to the Portal composite Runtime pattern discussed in 5.2, “Runtime pattern for the Portal composite pattern” on page 72. This Runtime pattern is a high-level representation of a portal architecture that separates the components so that each component can be chosen for maximum scalability. Scalability is also important because the system should be designed and built only once and should be able to handle increased demands. This supports the general business driver of reduced cost and operational efficiency. Extensibility: Extensibility in a system design allows for easier functional enhancement as the needs of the business change and/or increase. Once again, this IT driver supports the general business driver of reduced cost by being able to reuse the same architected solution. 22 Patterns: Portal Search Custom Design 2.2 Understanding the Patterns for e-business Understanding the Patterns for e-business, and specifically the Composite patterns, is not always a straightforward process. In interpreting how to use the Patterns for e-business, it is best to start with how people in different roles might leverage these to explain and/or justify a particular solution. In a portal implementation, the Portal composite pattern is the logical starting point. It identifies the Business and Integration patterns that make sense for the typical portal implementation. These roles are common in the IT industry, and each type of role will use and leverage the patterns in a different manner. Sales: The Sales role describes a person who is making the initial relationship with those organizations that might benefit from using and understanding the patterns. The role can be a person within an organization or a person from an external vendor (for example IBM Global Services) who has the expertise to understand the business problems and issues that need to be addressed. The sales person will use the patterns to begin the analysis discussion with the business level stakeholders to understand the business drivers. They will start with the Business and Integration patterns and likely continue to the Application and Runtime patterns, showing how you can move from Business patterns to Application patterns to Runtime patterns. During discussions with the business stakeholders, the initial team identifies high-level goals of the business and makes a determination that no specific Business or Integration pattern will address all the business drivers. At this point a Composite pattern makes sense and, specifically when there are information aggregation and other requirements that are fulfilled by a portal, the Portal composite pattern is a good starting point. Refer to Applying Pattern Approaches, SG24-6805, for more details on how to use the Patterns for e-business in a sales role. Project Manager: A project manager will need an understanding of those patterns that have already been chosen so that a set of tasks can be derived. Priorities can be set because once the Business and Integration patterns have been chosen, the business and IT drivers are understood. Architect: The architect is the bridge between the business and technology domains. Once the business drivers are understood (from the discussions with the Sales role and the business stakeholders), the architect can decide on likely IT drivers, namely those goals IT must focus on to fulfill the business drivers. Chapter 2. Portal composite pattern and custom designs introduction 23 Combined discussions with both the business and IT stakeholders are important so that all can participate in the process of determining the final set of Business and Integration patterns, then decide on the Application patterns, and finally decide on the set of Runtime patterns. Once this is complete, the architect can derive an initial architecture (operational architecture and general architecture overview) for the implemented portal solution. When design begins, it is the role of the architect to understand the “big picture” of the system and to make sure the proper components are given priority so that those business drivers that are most important are given top priority. A Composite pattern such as the Portal composite pattern saves the architect time by performing some initial “integration” work, bringing together various characteristics that are important to a typical portal implementation. Anything that speeds up the process, such as the Patterns for e-business assets, saves time and increases the chances for a successful implementation. Developer: Although developers are generally tasked with very specific programming level tasks, it is important for those in this role to understand the general thinking behind how the architecture was originally designed. This allows the team to leverage the focused technical knowledge of a developer (an expert Java programmer, for example) to understand how their tasks fit into the system and to alert the team to how their work might impact other components being designed. This role works to augment the architect role. The developer also uses the Application design and development guidelines provided by the Patterns for e-business to assist and speed up the application development cycle. The patterns are used to bring together the business and technical people in an organization. The intersection point of these two groups is the set of Runtime patterns that are detailed enough for developers and abstract enough for business people (because these Runtime patterns are far less complex than a portal or systems architecture diagram). The Portal composite pattern is valuable because it performs some of the initial “integration thoughts” that lead to a typical portal implementation. Of course, the standard caution that “your mileage may vary” applies to the use of this Composite pattern, because each implementation will introduce some variation. Using just one or two Business and/or Integration patterns may not address all the needs of both the business and IT drivers of the equation. The creation of the Portal composite pattern has brought together a combination of patterns that can jump start the design and analysis process. You realize savings in time, people, and thus money by leveraging reusable assets such as the Patterns for e-business. 24 Patterns: Portal Search Custom Design 2.3 Portal custom designs A Custom design, like the Composite patterns, combines Business and Integration patterns to create advanced, end-to-end e-business applications. These solutions, however, have not been implemented to the extent of the Composite patterns. The Business and Integration patterns that could be combined in any given Portal Custom design are as follows: Access Integration Self-Service Collaboration Information Aggregation Extended Enterprise Application Integration Depending on the type of portal solution being deployed, different combinations are implemented based on the required functionality. Some of these Business and Integration patterns are more common. However, our premise here is that these patterns common to any Portal custom design will contribute to the Portal composite pattern that is the focus of this endeavor. One of the patterns, Access Integration, can be considered the most distinctive pattern for a portal, given its focus on improving a user’s access to information and e-business services. Since the Integration patterns are used to extend the capabilities of Business patterns, we will also be looking for which of the other patterns contribute to a specific portal scenario. We will see that Self-Service, Collaboration, Information Aggregation, and Application Integration can also be important to portal solutions. 2.3.1 Access Integration pattern The Access Integration pattern is commonly observed in e-business solutions that provide users a seamless and consistent user experience that combines access to multiple applications, databases, and services. It is used as a front-end integration pattern. The Access Integration pattern does not stand alone in a solution, but is typically used to combine Business patterns to create custom designs and Composite patterns used to solve complex business problems. Access Integration contains many of the characteristics that describe a portal implementation. It fits well into the Portal composite pattern because it includes aggregation and management of information and access to information by various user and group types, and the business “rules” have been clearly defined that determine which user types can access certain types of data. Chapter 2. Portal composite pattern and custom designs introduction 25 For more information on the Access Integration pattern and its services, refer to the Access Integration Pattern Using WebSphere Portal Server, SG24-6267. 2.3.2 Self-Service business pattern The Self-Service business pattern describes situations where users are interacting with a business application view or update data. Often an organization not only wants to disseminate information internally but also wants to make this information available to external users and partners. The Self-Service business pattern is focused on allowing the end user access to information from various data sources using a mechanism that allows the user to access just the specific information that applies. For more information on the Self-Service business pattern, refer to the following redbooks: Patterns: Self-Service Application Solutions using WebSphere V5, SG24-6591 Self-Service Applications Using IBM WebSphere V5.0 and WebSphere MQ Integrator V2.1 Patterns for e-business Series, SG246875 2.3.3 Collaboration business pattern The Collaboration business pattern enables interaction and collaboration between users including e-mail, virtual team meetings, e-learning, instant messaging, and workflow processes. This pattern can be observed in solutions that support small or extended teams who need to work together in order to achieve a joint goal. Collaboration can often combine with a workflow engine that provides the ability to set up and support more complex processes that might involve multiple users from different workgroups, departments, and organizations. An emerging capability is the concept of contextual collaboration that incorporates functions previously found only in knowledge management applications. This includes the ability to apply context to a piece of content, to discover the experts within the organization, and to add collaborative functions to transaction-based applications. Collaboration is a core feature of a portal implementation. 26 Patterns: Portal Search Custom Design 2.3.4 Information Aggregation business pattern The Information Aggregation business pattern describes situations where users access and manipulate large amounts of data collected from multiple sources. There are two broad aspects to consider: 1) populating data stores with aggregated data, and 2) access to the aggregated data. Population is accomplished through Application Integration techniques1. How access is supported is related to the scope of data being accessed. Access to a small portion of aggregated data is easily handled by the Self-Service business pattern. Access to a single individual’s account summary that was aggregated from multiple systems is an example of this. On the other hand, when access is characterized by analysis and manipulation of large amounts of the aggregated data, then Information Aggregation applies. This type of access typically uses sophisticated tools that analyze, summarize and report on large quantities of aggregated data stored in specially designed databases optimized for just this type of analysis and reporting. 2.3.5 Extended Enterprise business pattern The Extended Enterprise business pattern describes the programmatic interaction between two distinct businesses. The focus of the Portal composite pattern is to implement a portal within a business or single enterprise. It does not directly address how two separate enterprises will interact. In this book, our analysis revealed that treating external enterprises as just additional “data sources” seems clearer than talking about enterprise-to-enterprise interaction. However, this is open to interpretation. Portals are about intergrating data and processes, so this pattern only makes sense when bringing together the data sources and systems in two enterprises. Thus, what this implies is more complex re-architecture of two systems. It is just as effective and less complex to simply treat other external systems as just data sources, the same as local databases or applications. If these external systems support common communication methods, this makes the intergation that much easier. 1 This reflects a recent change to the alignment of population related Application patterns. Previously these patterns were grouped together with information access patterns and associated with Information Aggregation. The population patterns are now considered to be part of the data focused Application Integration patterns. Information Aggregation now addresses information access, in particular when that access is characterized by analysis and manipulation of large amounts of aggregated information, an approach typically associated with business intelligence... Chapter 2. Portal composite pattern and custom designs introduction 27 2.3.6 Application Integration pattern The Application Integration pattern provides for the seamless back-end integration of multiple applications and/or data. Application Integration can be process focused as well as data focused, and can therefore address integration requirements for any of the Business patterns. A portal can act as an integration mechanism for both application services and information. Self-Service can use a process focused approach to transparently invoke enterprise application services, and can depend on data focused application integration to populate a centralized operational data store containing customer information. Information Aggregation may depend on data focused application integration to populate data stores that will be used for analysis and reporting. This means that either or both forms of application integration may be utilized by the portal, depending on the specific scenario. 2.3.7 Portal characteristics The diagram shown in Figure 2-1 was part of the process used to identify the Business, Integration, Application, and Runtime patterns that could be combined into a Portal composite pattern based on the characteristics we needed. You can use this diagram as a starting point to help determine the best fit for the particular solution you need to create. 28 Patterns: Portal Search Custom Design Business and Integration Application Patterns Access Integration Personalized Delivery Optional Extended Single Sign-On Web Single Sign-On Optional Pervasive Device Access Self-Service Directly Integrated Single Channel Collaboration Store and Retrieve Optional Application Integration Directed Collaboration Population Single-Step Index Population Figure 2-1 Patterns hierarchy contributing to the Portal composite pattern Chapter 2. Portal composite pattern and custom designs introduction 29 2.3.8 The Portal composite pattern The Business and Integration patterns that we have identified as the building blocks or the more common patterns of the Portal composite pattern are as follows: Access Integration pattern Self-Service business pattern Collaboration business pattern Application Integration pattern Please note that based on your specific requirements, your building blocks of the Business and Integration patterns for your portal may vary from the Portal composite pattern. For example, you may find that you have use for the Extended Enterprise business pattern in addition to the ones we defined, or you may find that you only need the Access Integration, Collaboration, and Information Integration business patterns for your portal. Based on your specific requirements, this would then be defined as a Portal custom design. Self-Service Collaboration Information Aggregation (Optional) Extended Enterprise (Optional) Application Integration Access Integration For this redbook, the visual representation of the Portal composite pattern is shown in Figure 2-2. Figure 2-2 Portal composite pattern showing our mandatory patterns2 2 This composite pattern reflects the re-alignment of data population patterns to Application Integration from Information Aggregation. 30 Patterns: Portal Search Custom Design 2.3.9 Benefits The Portal composite pattern is a combination of patterns, technologies, and products. It allows for an understanding of the business and IT drivers that help an organization answer these questions: Do I need a portal? What can I achieve with a portal? Once an organization has determined that it needs to aggregate information, target that information to specific users, analyze the usage of information, and collect and manage information, it can use a portal to handle these requirements. Consequently, using the Portal composite pattern will eventually lead to a choice of Application patterns and the subsequent combined Runtime pattern. This, in turn, will drive the creation of a portal architecture. Some specific benefits include: A single aggregated view of content targeted to specific user types Ability to analyze usage patterns to make marketing efforts more efficient Ability to tailor the user interface to specific groups enabling a focus on cultural, language, and nationality-based differences Single sign-on, allowing the user to “save time” and have access to information while lessening the requirements for direct interaction with the organization (saves money) Enables collaboration and human interaction in the context of the business process Enables widespread teams to work together efficiently in the context of the business process. 2.3.10 Limitations The creation of a portal can in some cases be a complex undertaking. The degree of complexity is driven in large part by the scope or range of application services and aggregated content that will be provided through the portal. As the number of applications being integrated increases, or the complexity of the content or aggregated information expands, a portal implementation will likewise increase in complexity. This translates into an impact to the IT organization within the enterprise. Chapter 2. Portal composite pattern and custom designs introduction 31 Introduction of a portal also can have an impact on the business organization within the enterprise. Although this should be an intended rather than unexpected result of a portal implementation, the following are examples of what should be considered and planned for: Organizational changes Process changes Restructuring of existing data sources Rebuilding some existing applications to support available connectivity options The detailed analysis of the various user groups that need to be supported (usually in much more detail than what currently exists) The Portal composite pattern assumes that there will be impacts in all of these areas. 2.4 Summary In summary, the Portal composite pattern includes characteristics from several Business and Integration patterns that are typically part of a portal implementation. However, when designing your solution, re-evaluate the chosen patterns to assure that they contain the characteristics that are important for the portal solution you are creating. Remember that it is ultimately based on the business drivers and choosing a pattern and subsequent architecture that supports those drivers. 32 Patterns: Portal Search Custom Design Part 2 Part 2 Portal Search custom design . © Copyright IBM Corp. 2004. All rights reserved. 33 34 Patterns: Portal Search Custom Design 3 Chapter 3. The Portal Search custom design So far in this redbook, we have introduced the key concepts behind the IBM Patterns for e-business, and we have introduced the Portal composite pattern and Portal custom designs in general. In this chapter, we will introduce a specific variation of the Portal composite pattern in the form of a Custom design — the Portal Search custom design. The goal of this custom design is to provide a solution for the advanced search requirements that are currently being identified as organizations deploy portal solutions into their environments, and integrate them into their business processes. © Copyright IBM Corp. 2004. All rights reserved. 35 3.1 What is a Custom design? As introduced in Chapter 1, “Patterns for e-business introduction” and Chapter 2, “Portal composite pattern and custom designs introduction” on page 17, Custom designs are similar to Composite patterns, as they combine Business patterns and Integration patterns to form an advanced, end-to-end solution. These solutions, however, have not been implemented to the extent of Composite patterns, but are instead developed to solve the e-business problems of one specific company, or perhaps several enterprises with similar problems. In general, Custom designs do not meet the higher qualifications of a Composite pattern, and do not give as great a reassurance of reusability, because they have not been "recurrently employed to solve the problems of businesses across a wide range of industries." However, as the Custom designs detailed on the Patterns for e-business Web site and within this redbook are used more and more by diverse developers, who are vocal about the benefits and limitations of these solutions, these Custom designs might eventually achieve the status of Composite patterns. 3.2 The need for portal search capabilities As businesses have begun deploying and integrating portal solutions, the need for more extensive search and retrieval capabilities beyond those provided in the base Portal composite pattern has begun to surface. Overall, the Portal composite pattern provides data access via simple self service capabilties that allow one to receive small views of data. The inclusion of these self service capabilties as part of an overall secure and personalized Portal interface means that these “snippets” of data are organized in a manner such that is easy to locate and find. However, most enterprises are still left with no clear way to access and search across all of their corporate data and knowledge, especially the unstructured data that is not normally accessible via normal business intelligence tools. While each key data source may be able to be surfaced in a self-service manner within the portal as a portlet application, each still requires training on syntax, semantics, and interfaces. Obviously, in any portal implementation, the ability to locate such data and information is vital — and thus the next logical step in providing information location capabilties in a portal is to provide robust search and retrieval capabilties that can encompass these disparate data sources. This includes expanding search capabilties outside the context of the portal; to enter a single request that searches potentially thousands of data repositories, the Internet, and for people with expert knowledge, at the same time. These repositories could be of varied 36 Patterns: Portal Search Custom Design content and structure and, like the experts with whom you need to collaborate, they might be geographically dispersed throughout the world. It is combining these advanced search needs with the other personalization, self-service, and collaborative capabilties of a Portal composite pattern, that takes the value of a portal to the next level. Organizations are able to perform all key business activities from within the portal, resulting in an improved efficiency, and a reduced latency of business events. Thus, in defining the need for such advanced search capabilties, we have ultimately defined the business drivers for such a Portal Search custom design, as follows: To help streamline current business activities, by improving organizational efficiency and reducing the latency of business events. When this desire is applied to search capabilities, one can see that these efficiency improvements will ultimately come from: a. Distilling meaningful information from a vast amount of structured and unstructured data b. Providing easier access to vast amounts of unstructured data through indexing, categorization, and other advanced forms of summarization 3.3 Technology drivers As we described when we introduced the Portal composite pattern earlier in this redbook, while it is the business drivers that ultimately drive an IT organization, the appropriate use of technologies can also have an important business impact in terms of: Minimized application complexity Minimized total cost of ownership (TCO) Leverage of existing skills Leverage of legacy investments Back-end application integration Minimized enterprise complexity Maintainability Scalability Availability For the advanced portal search needs we have described to this point, there can be substantial savings in terms of reducing costs, leveraging legacy investments, minimizing enterprise complexity, and simplifying maintainability — all depending on the manner in which these search capabilties are built and integrated. Chapter 3. The Portal Search custom design 37 For example, when building a single search interface that accesses multiple data repositories and legacy systems on the back-end, how does one do this in a cost effective manner? How does one build interfaces into all of these systems so that they are easily maintained? What happens when a given repository is upgraded or replaced? What are the impact and costs for such a change? By clearly defining IT drivers in regards to cost, simplicity, and maintainability as key decision points in the solution, we will ensure that any solution meets all of the wider business needs beyond the purely functional needs. Thus, these defined IT drivers, along with the business drivers, ultimately feed into the selection of Application patterns that are appropriate and will be used to implement this Custom design. 3.4 The Custom design Now that we have defined the business and IT context for this custom design to be built on top of the Portal composite pattern, it is time to discuss the specific Business patterns that apply. However, to be able to clearly distinguish this Custom design from the Composite pattern, we must go down to the Application pattern level as well. When considering the application patterns for any composite pattern or custom design, both mandatory and optional patterns are described. The mandatory patterns represent the recurring patterns that should be regularly implemented by companies in a Portal Search custom design. The optional patterns represent patterns that are not necessarily implemented with each solution, but may make sense to include for a specific company requirements. Their inclusion would result in yet another Portal custom design. Figure 3-1 depicts the normal mandatory and optional Application patterns for a Portal solution, as they map to the business and integration patterns already discussed in Chapter 2, “Portal composite pattern and custom designs introduction. As discussed in this previous chapter, these identified Application patterns are based on the requirements of typical portal implementations. 38 Patterns: Portal Search Custom Design Note: As mentioned earlier in this redbook, recent changes have been made to the alignment of the “Population” related Application patterns, as well as to the naming of some of the Information Aggregation patterns. The population patterns are now considered to be part of the data focused Application Integration patterns. Information Aggregation now addresses information access, in particular when that access is characterized by analysis and manipulation of large amounts of aggregated information, an approach typically associated with business intelligence, content management, and knowledge management. Details on these recent changes in patterns can be found in Appendix A, “Pattern changes” on page 205. Business and Integration Application Patterns Access Integration Personalized Delivery Optional Extended Single Sign-On Web Single Sign-On Optional Pervasive Device Access Self-Service Directly Integrated Single Channel Collaboration Store and Retrieve Optional Application Integration Directed Collaboration Population: Single-Step Population: Index Population Figure 3-1 Portal composite pattern — typical application patterns Chapter 3. The Portal Search custom design 39 When examining the patterns included in Figure 3-1, and the functionality provided by these patterns, it is clear that some level of “data access” is included via the Application Integration::Population and Self-Service::Directly Integrated Single Channel patterns. Of course, as discussed earlier in this book, how access is supported is related to the scope of the data access — and only access to small snippets of information is supported in the base Portal composite pattern via inclusion of the Self-Service pattern. However, the business and IT drivers we defined earlier for this Portal Search custom design clearly show the need for data access characterized by analysis and manipulation of large amounts of the data. In such cases “Information Aggregation” applies, and the Information Aggregation business pattern must be introduced to this design to support this need. The Application pattern specifically related to search that we thus need to add to our Custom design is the Information Aggregation::User Search and Discovery application pattern. Additionally, as the amount and scope of data access increases, the capabitlies required for populating data stores with such data also increases — and additional Application Integration patterns are required. To join the existing Population: Single Step, and Index Population patterns, we must also add the Population: Multi-step and Federation patterns to cover the more advanced data collection needs. Figure 3-2 shows this updated list of Application patterns that makes up the Portal Search custom design. 40 Patterns: Portal Search Custom Design Business and Integration Application Patterns Business and Integration Application Patterns Access Integration Personalized Delivery Application Integration Population: Single Step Optional Optional Extended Single Sign-On Population: Multi Step Web Single Sign-On Population: Index Population Pervasive Device Access Federation Self-Service Directly Integrated Single Channel Collaboration Store and Retrieve Optional Information Aggregation User Search & Discovery Directed Collaboration Figure 3-2 Portal Search custom design — application patterns As depicted, the changes between the base Portal composite pattern and the Portal Search custom design are the inclusion of the following additional patterns: Information Aggregation::User Search and Discovery Application Integration::Population: Multi-step Application Integration::Federation When combining these three new Application patterns with the previously existing Application Integration::Population: Single Step and Population: Index Population patterns, we have a full set of “search” related capabilties. It is these five Application Integration and Information Aggregation patterns that will be the focus of the rest of this redbook. Figure 3-3 then depicts a higher level comparison between the Portal composite pattern, and this custom design — in that the fundamental difference is the inclusion of the Information Aggregation business pattern. Chapter 3. The Portal Search custom design 41 Self-Service Collaboration Information Aggregation (Optional) Extended Enterprise (Optional) Collaboration Information Aggregation Extended Enterprise (Optional) Application Integration Self-Service Access Integration Portal Search custom design Application Integration Access Integration Portal composite pattern Figure 3-3 Portal composite pattern compared to Portal Search custom design However, it is important to note that the remainder of the mandatory and optional Application patterns from the original Portal composite pattern, which are still included within the custom design, are all crucial to integrating these search capabilties into a portal solution. For example, the Access Integration patterns are required to integrate the search capabilities with the other portal functionality via Single-Sign-On and Personalization; while the Collaboration and Self-Service patterns are required to provide the common collaborative/teaming benefits of a robust portal solution. For a detailed discussion on all of the application patterns associated with the overall Portal composite pattern, please see the IBM Redbooks: Patterns: A Portal composite pattern using WebSphere Portal V4.1.2, SG24-6869 Patterns: A Portal composite pattern using WebSphere Portal V5, SG24-6087 3.5 Summary At this point we have introduced the IBM Patterns for e-business, the Portal composite pattern, and now the Portal Search custom design. While this is a large amount of introductory material, we felt it important to show the full “lineage” of this Custom design, so that all readers can clearly understand the business context for this solution. It is now time to get into some of the details of this Custom design, and the technologies one can use to implement it — and we will do so in the remaining chapters of this redbook. 42 Patterns: Portal Search Custom Design 4 Chapter 4. Application patterns After identifying the Business and Integration patterns, the next step in planning an e-business application is to choose the Application pattern(s) that apply to the business drivers and objectives. An Application pattern shows the principal layout of the application, focusing on the shape of the application, the application logic, and the associated data. Such Application patterns are then taken to the next level, by creating a set of Runtime patterns. Runtime patterns are discussed within the next chapter. This chapter focuses on defining and describing the Application patterns that apply to typical e-business portals that have been extended for robust search capabilities. The Application patterns we will discuss fall under the Information Aggregation business pattern, and the Application Integration pattern. © Copyright IBM Corp. 2004. All rights reserved. 43 4.1 An overview of the Application patterns As identified in Chapter 3, “The Portal Search custom design“, the following five application patterns differentiate this Custom design from the basic Portal composite pattern. And it is these Application patterns that we discuss in this chapter: Information Aggregation::User Search and Discovery Application Integration::Population: Single Step Application Integration::Population: Multi-step Application Integration::Population: Index Population Application Integration::Federation However, prior to examining these specific application patterns in more detail, it is important to first understand the relationship between the Information Aggregation and Application Integration patterns. Basically, Application patterns use logical tiers to illustrate the various ways to configure the interaction between users, applications, and data. The focus in these tiers is on the application layout, shape, and application logic for the associated data. In some cases though, multiple Application patterns may be required to define a complete interaction between users, applications, and data. The results of one Application pattern will feed into another Application pattern, so that the combination of patterns results in a functioning e-business solution. Search solutions based on the Application Integration and Information Aggregation patterns follow this model. First, the data integration aspects of the Application Integration (also known as Enterprise Application Integration) patterns serve to integrate the information (or data) used by multiple applications. In the case of search solutions, existing data is available, in both structured and unstructured forms, in existing application data repositories. A proven repeatable pattern is thus needed for combining this data in one search, and this is where Application Integration patterns fall. Next, the Information Aggregation patterns allow users to access and manipulate data that is aggregated from multiple sources. Thus, these patterns take the data that is available from the multiple sources and applications via Application Integration, and provide tools to extract useful information and value from such large volumes of data. Figure 4-1 depicts this relationship between these two business and integration patterns. 44 Patterns: Portal Search Custom Design . Information Aggregation User Searchable Data Application Integration Original Application Data Figure 4-1 The relationship between Information Aggregation and Application Integration 4.2 Application Integration patterns As a whole, the Application Integration pattern (also known as Enterprise Application Integration) serves to integrate multiple Business patterns, or to integrate applications and data within an individual Business pattern. The pattern has two approaches for providing such integration: Process integration: The integration of the functional flow of processing between the applications Data integration: The integration of the information used by applications For search related solutions, it is primarily the data integration focused aspects of Application Integration that are involved to integrate data with an individual (Information Aggregation) business pattern. Thus, this section will focus on the Application Integration application patterns that implement such “data integration”. These “data integration” focused patterns can be broken into two sub-categories. Data movement application patterns: – Population: Single Step • Population: Multi-step • Population: Data Cleansing – Population: Index Population – Population: Synchronization Federated access application patterns: – Federation In general, these patterns apply to both “search”, and traditional business intelligence/data mining types of activities. However, the focus of the pattern descriptions in this redbook will focus more on their usage in the unstructured text/search world, rather than the more structured data/business intelligence world — although information on all aspects of the patterns will be provided whenever feasible. Chapter 4. Application patterns 45 4.2.1 Population: Single Step, Multi-step, and Data Cleansing The Population: Single Step, Multi-step, and Data Cleansing patterns all follow a similar model, and build upon each other. The primary business drivers for these Population patterns are to reconcile data from multiple data sources. In Single Step population, the reconciliation is sufficiently simple that it can be conceived as a single functional entity. In many cases, however, the transformation and restructuring is rather complex. This leads to the Multi-step variation. Similarly, extensive analysis and cleansing is emphasized in the Data Cleansing variation. These patterns are most often applied towards business intelligence related business problems. However, they can be utilized to provide content feeds into an e-business portal of more unstructured data. This “content” can then be accessed via the portal, or even searched via basic portal search capabilities. Business and IT drivers Here, we are concerned with the following business and IT drivers: Improve organizational efficiency Reduce the latency of business events Distill meaningful information from a vast amount of structure data Minimize total cost of ownership (TCO) Promote consistency of operational data Maintainability The primary business driver for choosing these Population: Single Step, Multi-step, or Data Cleansing patterns is to copy data from the source data store to a target data store with possible transformation of the data in the process. In the case of a single step, the main reason for creating a copy of the data is to avoid manipulating the primary source of a company’s operational data often maintained by Operational Systems. However, in the case of a multi-step or data cleansing, the data requires extensive reconciliation, transformation, and restructuring to improve usability. Solutions The Application pattern shown in Figure 4-2 represents the basic single step data population functionality by a “read dataset – process – write dataset” model. There can be one or more source data stores that are read by the population application. These source data stores are created and maintained by other processes. The target data store is the output from the population application. These can be the final output from the process, or can be an intermediate data store used as the source for another step in the process. 46 Patterns: Portal Search Custom Design Metadata Population method App Source Target Figure 4-2 Population: Single Step application pattern The box around the source data represents the fact that the source data may need to be accessed by means of a control application via an application API, or may be accessed directly via a database API. The metadata contains the rules describing which records from the source are read, how they are modified (if needed) on their way to the target, and how they are applied to the target. The rules are depicted in this way to emphasize the best practice of having a rules-driven application, rather than hard-coding the rules in the application, to facilitate maintenance. This logical dataset also holds a variety of metadata describing the output that the population application produces, such as statistics, timing information, and so on. In general, both source and target can contain any type of data, including structured and unstructured data. However, in the majority of the cases, this Application pattern is used for propagating structured data from one data store to another. In providing the capabilities outlined above, this Application pattern uses common services related to Data-focused integration such as Data replication, Cleansing, Transformation, and Augmentation. These common services are further elaborated on the Patterns for e-business Web site, specifically under the discussion about Application patterns for Application Integration: http://www.ibm.com/developerWorks/patterns/ Chapter 4. Application patterns 47 Figure 4-3 depicts the common three-step process. optional Metadata Metadata Transform Extract Metadata Load App Target Source Intermediate data Intermediate data Figure 4-3 Population: Multi-step variation In the Multi-step variation of the pattern, the building block provided by the Population: Single Step application pattern is repeated several times to achieve the desired results. The intermediate target data created by one step acts as the source data for the subsequent step. As shown in Figure 4-3, the application is divided into three logical tiers: extract, transform, and load. In most best practice implementations, these functional steps contain additional sub tasks: The Extract Tier extracts data from the source data store. This data store is typically owned by another application and used in a read/write fashion by that application. The extraction rules may range from a simple rule such as including all data, to a more complex rule, prescribing the extraction of only specific fields from specific records under varying conditions. The Transform Tier transforms data from an input to an output structure according to the supplied rules. Transformation covers a wide variety of activities, including reconciling data from many inputs, transforming data in individual fields based on predefined rules or based on the content of other fields, and so on. When two or more inputs are involved, there is generally no guarantee that all inputs will be present when required. The transform step must be able to handle this situation. The Load Tier loads the input data into the target data store. As with extract, load can range from a simple process of overwriting the target data store to a complex process of inserting new records and updating existing records. 48 Patterns: Portal Search Custom Design The actual implementation can involve a fewer or greater number of steps. In such cases, the diagram in Figure 4-3 must be adjusted accordingly, and consideration must be given to the placement of any additional tiers. It is also important to note that this Application pattern has been generalized to cover any source and target data stores. Finally, in the new Data Cleansing variation of this pattern shown in Figure 4-4, the transform and load step from the Multi-step pattern has been combined into a single data analysis and cleansing stage. This stage does not so much transform the data, but rather validates and cleanses the data of errors. Data is extensively analyzed for such errors, such that the resulting data may include calculated or deduced information. Additionally, the resulting data in this variation may in fact be written back to the original source database. Metadata Metadata Data Analysis & Cleansing Extract App Target Source Intermediate data Figure 4-4 Population: Data Cleansing variation Guidelines for use of these population patterns It is highly recommended that the logic that governs the transformation of source data into target data (including any transformation or cleansing) be implemented using rules-driven metadata rather than hard coding these rules. This approach enhances the maintainability of the application and hence this reduces the total cost of ownership. Benefits This is the ideal architecture when the required transformation between the source and target data store is simplistic. This architecture is ideal when data must be transformed between a source and target data store. This can include simplistic transformation of the data (single step) or complex transformation (Multi-step and Data Cleansing) — as all levels are supported by the variations in this pattern. Chapter 4. Application patterns 49 Limitations Most of the real world requirements for propagating structured data from one data store to another are complicated. They require extensive reconciliation, transformation, restructuring, and merging of data from multiple sources. Under such circumstances a single-step approach is obviously not advisable, and a multi-step approach should be undertaken. Additionally, reconciling data from multiple sources is often a complex undertaking and requires a considerable amount of effort, time, and resources. This is especially true when different systems use different semantics. Putting the Application pattern to use Consider a Financial Services Company that provides various services, including checking account, savings account, brokerage account, insurance, and so on. The company has built this impressive portfolio of services primarily through mergers and acquisitions. As a result, the company has inherited a number of product-specific operational systems. The company would like to create a Business Data Warehouse (BDW) that provides a consolidated view of customer information. It would like to use this consolidated information for sophisticated pattern analysis and fraud detection purposes. Populating such a BDW would require reconciling customer records from different operational systems that use different identification mechanisms to identify the same customer. Further, other operational systems record transactions with different time dependencies. The reconciliation process must resolve these semantic and time differences and must check for any inconsistencies and irregularities. Due to the complexity involved, the Financial Services Company chooses the Population: Multi-step and Population: Data Cleansing application patterns. Implementation details The Population: Single Step application pattern is not documented with additional Runtime patterns or product mappings. Because Single Step functionality is essentially a simplified version of the functionality found in the Population: Multi-step application pattern, the solution designs documented there can be used as a basis for further understanding implementation details of the Population: Single Step application pattern. 4.2.2 Population: Index Population application pattern The Population:Index Population application pattern is a new application pattern that combines several prior population application patterns. Specifically, it represents the combination of the previous “Population Crawl and Discovery”, and “Population Summarization” applications patterns. As the business world’s usage and understanding of search solutions has increased, it has become 50 Patterns: Portal Search Custom Design apparent that a single Application pattern more accurately represented the solutions being built today. Thus, this single “Index Population” application pattern has replaced the original multiple search related Population application patterns that existed. Overall, Index Population provides a structure for applications that retrieve and parse documents and data, and create resulting indices, taxonomies, and other summarizations of the original data. These result sets may include: A basic index of relevant documents that match a specified selection criteria A categorization or clustering of common documents from the original data An automatically built taxonomy of the original data, to allow for easy browsing Expertise location, automatically mapping the authors of the original data to topics of “expert based on the contents of the documents, and the categories discovered. Business and IT drivers Here, we are concerned with the following business and IT drivers: Improve organizational efficiency Reduce the latency of business events Provide easier access to vast amounts of unstructured data through indexing, categorization, and other advanced forms of summarization. Provide access to corporate/institutional “tacit” knowledge via identification of experts within the organization. Note: “Tacit” knowledge is the untapped knowledge still within the human mind that has not yet made it into documents and formal data. Minimize total cost of ownership Maintainability Overall, the primary business driver for choosing the Index Population application pattern is to provide a more usable and relevant organization of documents or unstructured data, built from a vast set of original documents, and based on a specified selection criteria. The objective is to provide quick access to useful information instead of bombarding the user with too much information. Search engines that crawl the World Wide Web/file systems implement this Application pattern; as well as the more advanced “discovery” search engines that perform document clustering/categorization, expertise location (that is, identify experts), and intelligent analysis of the document contents. Chapter 4. Application patterns 51 This pattern is best suited for selecting useful information from a huge collection of unstructured textual data. A variation of this application pattern can be used for working with other forms of unstructured data such as images, audio, and video files — in such cases additional transformation and translations services are required to parse and analyze the data. The solution As shown in Figure 4-5, this Application pattern mainly follows the framework proposed by the Population: Single Step application pattern. However, in the case of this Index Population application pattern, the “Search, Discover, and Indexing” tier crawls through multiple data stores, retrieving documents, parsing them, and building a result set of all documents that match the selection criteria. In some cases, such as World Wide Web search engines, the contents of documents in one data source (that is, URL links) may actually be used to determine additional data sources to crawl. Metadata Index and Taxonomy Retrieve & Parse documents Index, Summarize, & Categorize Method App N App 2 Sources (In practice, may be a multi-step method) Figure 4-5 Population: Index Population application pattern When the unstructured data recovered by these activities must be transformed, cleansed, or manipulated before it can be purposefully used, a Multi-step variant of this application pattern based on the Population: Multi-step application pattern might be required. This Multi-step approach is often required for the more advanced search applications that perform document clustering and expertise identification. An initial step will often exist to parse the original data from multiple sources and build a single interim “index” that contains key pieces of document data and meta-data. This initial step then allows additional steps to summarize, categorize, create taxonomies, or locate experts from this single normalized index. 52 Patterns: Portal Search Custom Design Additionally, this Application pattern can probably be even further decomposed into two separate Application patterns — one that performs “search” types of activities, and another that actually populates the index from these searches. However, in this current book, we will simplify things by treating the Population: Index Population application pattern as simply another Population pattern. Such decomposition of this pattern any further will be left to future redbooks. Guidelines for usage As discussed earlier in this section, in many case the pattern will be implemented in a Multi-step approach, utilizing intermediate data stores at each stage. The resulting data may even be distributed across multiple target data stores. For example, an indexing engine that produces a basic document index, summary, and taxonomy — may create the index during an initial step, the summary in a secondary step, and then the taxonomy in a final third step; storing the taxonomy in a 2nd target data store to improve performance when accessing the index or walking the taxonomy. Benefits This is the ideal architecture for extracting useful information from a vast set of unstructured textual data. Limitations This Application pattern is geared towards unstructured text document location and search needs. This pattern does not apply to business intelligence types of activities — where the more structured data focused Population: Single Step, Multi-step, and Data Cleansing application patterns apply. Usage scenarios Consider a large software company with a huge array of software products. The company develops vast amounts of technical documentation to support these products. Each product line publishes its own documentation on its own department Web site. As products change, so does the technical support documentation. Locating a particular piece of information in this sea of ever changing data can be quite challenging and time consuming. In order to improve efficiency of information access, the company wants to create a categorized and federated index of all documents — that can then be searched or browsed by users as needed to find the required information. Such an index must be refreshed on a periodic basis to keep it current. To meet these requirements, the software company chooses to implement the Population: Index Population application pattern. Chapter 4. Application patterns 53 4.2.3 Population: Synchronization application pattern There is one more “data movement” Application Integration pattern that, while it does not directly relate to a portal search solution in all cases, should be referenced here for a complete understanding of this group of patterns. This additional pattern is the Population: Synchronization application pattern, but was previously known as the “Replication” pattern. It enables a coordinated bidirectional update flow of data in a multi-copy database environment. It is important to highlight the “two-way” synchronization aspect of this pattern, as it is separate from the “one-way” capabilities provided by the Population patterns already discussed. Business and IT drivers This Application pattern may be required for geographically dispersed applications using similar database technologies and schemas. It is needed by mobile workers who can not have direct access to the central repository. Inherent support of synchronization by database products makes this an ideal solution for distributed environments. For homogeneous database environments, synchronization is very straightforward. The solution As shown in Figure 4-6, this pattern is a basic two-way synchronization of data between separate data stores. The two variations shown represent the fact that the data may be replicated via a controlling application through an application API, or may be replicated directly via database APIs. App 1 Synchronization Method App 2 OR App 1 App 2 Synchronization Method Figure 4-6 Population: Synchronization application pattern 54 Patterns: Portal Search Custom Design Applications in this solution design do not necessarily have to be identical, but underlying database schemas should be. Synchronization processing can be used for propagation by eliminating the feedback process. Guidelines for usage Synchronization conflict resolution needs to be incorporated into the exception processing design of this solution. If the data is updated by both the source and target systems then the sychronization may fail because of timestamp conflicts. If dual control is required for a synchronization solution then a conflict resolution solution must be implemented. Usage Scenarios As an example of this Application pattern, consider a mobile worker downloading a work list at the beginning of the day, and uploading updates to this work list at the end of the day. Also, consider the broadcast of product / price information to multiple lines of business where all the LOBs use the same applications as an illustration of this Application pattern put to use. 4.2.4 Federation application pattern The Federation application pattern, previously known as the Federated Repository application pattern, creates a unified query interface into isolated structured and unstructured repositories. It is a key component of an overall search solution when many complex and diverse sources must be integrated. Business and IT drivers This pattern is normally selected for the same business drivers as for the “data movement” patterns. That is: Improve organizational efficiency Reduce the latency of business events Distill meaningful information from a vast amount of structure data However, it is the IT drivers that distinguishes the selection of this pattern over the more basic “data movement” patterns; as its “connector/adaptor” design allows for improved: Maintainability Minimized total cost of ownership Leverage of existing technology investments Reduced deployment and implementation costs The solution As shown in Figure 4-7, this pattern provides a real-time query interface into both structured and unstructured data. Metadata mapping enables the decomposition of a unified query into requests to each individual repository. The information Chapter 4. Application patterns 55 model appears as one unified virtual repository to users. Using adapters for each target repository, multiple disjoint formats can be integrated into a common federated schema. Metadata App N read App 1 Data Integration Method App 2 read/write Figure 4-7 Federation application Pattern Is is important to note that this pattern is not a user accessed pattern, but rather represents the method in which another system or application can access integrated data. Benefits This Application pattern is appropriate when an infrastructure for integrating data sources is needed without the need for propagation and/or additional repositories. It can be driven by the need for unified information access by portal projects. It is useful where relational data and text data need to be accessible through one common Web search interface. It is also applicable for the structured data-only/business intelligence solutions where the frequency of change of application data would prohibit an Operational Data Store type of solution. Limitations This solution eliminates any duplication of data that exists in other data integration patterns but does require metadata mapping during the processing of the federated query. The architecture of this federated real-time query environment needs to be tuned for optimum performance. Usage scenarios As an example of the appropriate use of this Application pattern, consider the situation in which an insurance agent requires access to a customer's policy information, policy document, and pictures from an automobile accident claim. A line of business executive could use this Application pattern to view information 56 Patterns: Portal Search Custom Design about one of his key customers using an aggregate of sales data, customer account information, and the latest news from syndicated sources As another example, consider when an customer support agent requires information about a certain product to answer a customers support call. Documents exist in multiple locations, a file system, a knowledge base, a Web site, etc., from which an answer could be found. Additionally, search interfaces to some of these locations have already been created in the past. Rather than requiring the customer support agent to use multiple search engines to locate the needed information, a single “federated search” application could be created that would perform a unified query against all of the existing search engines, and return one set of normalized results — so that the customer support agent can quickly find the relevant information to solve the customer’s problem. 4.3 Information Aggregation patterns As mentioned earlier, the Information Aggregation business pattern, also known as User-to-Data, can be observed in e-business solutions that allow users to access and manipulate data that is aggregated from multiple sources. This Business pattern captures the process of taking large volumes of data, text, images, video, and so on, and using tools to extract useful information from them. There are two key applications patterns in this area that we will discuss: User Information Access (UIA) User Search and Discovery (US&D) Overall, these patterns represent similar functionality, with the User Information Access applying to primarily structured data/business intelligence types of applications, and the User Search and Discovery pattern applying to unstructured data/knowledge search applications. However, both of these patterns will be discussed in this section. 4.3.1 User Information Access application pattern The User Information Access application pattern, previously known as the Information Access application pattern, helps structure a system design that provides access to aggregated information. It is most often used in conjunction with one of the “data movement” Application Integration patterns already discussed, to provide a data access interface for users into an aggregated repository created by these data movement patterns. For the more data oriented applications to which this pattern applies, this might also be called a “query”. Chapter 4. Application patterns 57 There are two forms of this pattern that we will discuss. First is a basic variation, in which users view but cannot update information. The second variation is very simliar to the basic read-only variation, except that the data sources may be updated. Business and IT drivers Here, we are concerned with the following business and IT drivers: Improve organizational efficiency Reduce the latency of business events Provide access to distilled information and drill-through capability Minimize total cost of ownership (TCO) Promote consistency of Operational Data Maintainability The primary business driver for choosing this Application pattern is to provide efficient access to information that has been aggregated from multiple sources. This mechanism can access both structured and unstructured data populated by one or more of the “data movement” application patterns. Internal and/or external users may use this information for decision-making purposes. For example, an Executive Information System (EIS) might generate a summary report on a periodic basis that compares the sales performance of various divisions of a company with the sales targets of those divisions. In this example, the User Information Access application pattern is used for accessing information from structured raw data. In addition, the application may provide drill-through capability allowing the user to track the performance of individual sales representatives against their individual targets. The basic read only solution Figure 4-8 shows a diagram of the User Information Access application pattern. Metadata App N read Pres. Data Integration Method App 2 read Figure 4-8 User Information Access application pattern 58 Patterns: Portal Search Custom Design As shown in Figure 4-8, the basic User Information Access application pattern is broken into three logical tiers: The Presentation Tier is responsible for all the user interface related logic that includes data formatting and screen navigation. In some cases the presentation might be as simple as a printout. The Data Integration Tier is responsible for accessing the associated read only data store and distilling the required information from this data. An additional “drill-through” Application Tier is sometimes provided in this pattern to allow the ability to drill-through to detailed data. However, drill-though capability is not always required so is not depicted in the pattern diagram. Data necessary to enable drill-through might already exist and be accessed from an existing information access application, or might be defined in the scope of this information access application. Additionally, the box around one of the target sources represents the fact that the target data may need to be access via a controlling application’s application API, or may be accessed directly via a database API. Two variations Patterns that include update facilities are needed because many data oriented systems that provide query and reporting also need to accommodate changes provided by the end users. For example, a user analyzing financial results may also wish to include their own budgetary figures. Thus, a very minor change to this pattern is needed to show the “read/write” capabilities of the access to the data sources. One read/write variation is shown in Figure 4-9. Metadata Read/write Pres. App N Data Integration Method App 2 read/write Figure 4-9 User Information Access application pattern: immediate update There are actually two main methods in which an update can be supported – immediate or batched. Chapter 4. Application patterns 59 In an “immediate” variation, any user updates are immediately applied to the source data. However, when multiple source applications are involved, propagating changes back to the source applications can be extremely complex, possibly even requiring two-way synchronization of data. Such an “immediate” approach is depicted in Figure 4-9 on page 59. Note that in some cases it may make sense to provide an additional “update” node to handle the update processing, especially if it is complex. A “batched” variation may also make sense in some cases in which updates are not as time critical. Such an approach may involve an additional update “staging” repository to hold any updates made by the user. Such updates would then be applied on some set schedule, as a “batch” process. Similar to the immediate approach, this update process may be implemented via an additional “update” node — or, even handled by a “population” tier that is external to this pattern. The choice of a batched or immediate variation will usually be based on various IT drivers in the environment. There may be existing population methods already in existence and ready to be leveraged for the “batch” updating. Alternatively, the business controls and rules of an enterprise may require all updates to come from a single controlling application. The introduction of an additional update source, as in the “immediate update” variation of this pattern, may introduce data integrity and reliability issues. Guidelines for use A clear separation of the presentation logic and the information access logic increases the maintainability of the application and decreases the total cost of ownership. This allows the same information to be accessed using various user interfaces. Benefits For the basic pattern, the use of read-only data provides for maximum consistency in a multi-user analysis or reporting environment. This simple yet powerful Application pattern meets the majority of the information aggregation and distillation needs. The simplicity of this pattern reduces implementation risk. Limitations In some cases, the data sources utilized in this pattern will be single consolidated data sources (that is, ODS) taken from multiple original datasets. In such cases, any updates to the consolidated data would not be propagated back to the originating data/source applications. One option is to have updates occur via “drill through” capabilities, in which users drill through to the original data and make updates via original source application capabilities. Updates then occur to the original data; however, a time delay could occur until these updates are available in the consolidated data that is accessed by data queries. 60 Patterns: Portal Search Custom Design Putting the Application pattern to use Consider a Personal Portal such as my.yahoo.com that aggregates information from disparate data sources and allows users to personalize this information to meet their preferences. These portals aggregate both structured data such as weather information and stock quotes and unstructured data such as news and links to other sources of information. Based on the type of the data and the amount of transformation required, the portal developers may choose one or more of the Population application patterns. Once the data has been stored in the optimal format, the portal developers may offer the User Information Access application pattern to search this information, access this information in a personalized style, and/or to provide drill-through capabilities. 4.3.2 User Search and Discovery application pattern The User Search and Discovery pattern is very similar to the User Information Access pattern. However, as previously mentioned, while User Information Access applies primarily to structured data/business intelligence types of applications, the User Search and Discovery pattern normally applies to unstructured data/knowledge search applications. This search-focused User Search and Discovery application pattern is shown in Figure 4-10. Metadata Retrieve & Parse information Pres. Search, Discover & Optional Additional Function App N App 2 Figure 4-10 User Search and Discovery application pattern With a quick glance, one can easily see that this pattern is almost a duplicate of the User Information Access application pattern discussed earlier. The only real change is that the previous “data integration” tier has been replaced by a “search” tier. This is done to highlight the “search” aspects of this pattern versus the normal relational “query” aspects of the User Information Access pattern. Chapter 4. Application patterns 61 Overall, the search tier supports a unified query mapping, across multiple data indices. Both metadata and query syntax mapping enables the decomposition of such a unified query into requests understood by each individual data index. This syntax mapping is an important aspect, as search concepts do not yet have a unified query language such as SQL used in normal data querying applications. In simple cases only one data index is searched, while in other cases multiple indexes are searched, and the search tier takes on “brokering” capabilities. That is, the search tier would “broker” search requests to the data indices, and result sets are then combined and normalized by the search tier’s “brokering” capabilities. In fact, in such “brokered” cases the actual “connection” to the data may be handled by search “adapters” or connectors for each target repository. These “adapters” then contain the logic to access the data, perform the search, and then send the results back to the “search” tier where they can be consolidated and combined with other result sets. However, this is really getting into Runtime pattern types of details, so we will leave further discussion on the search “adapters” to 5.5.1, “User Search and Discovery Runtime pattern” on page 86. A final aspect to consider, when comparing this pattern to the User Information Access pattern, is the data sources themselves. While in the User Information Access pattern, real data may be accessed directly via normal relational database/SQL queries, with the User Search and Discovery pattern it is normally data indices that are searched — note the actual data itself. That is, consolidated indices of data are built by population methods, that in turn index multiple other data sources — be they files on a file system, HTML pages on the internet, or even “blob” data from a relational repository. Thus, the actual types of data being “queried” is another distinguishing feature of this pattern over the User Information Access pattern. Drivers and guidelines for usage In general, the same business and IT drivers, and guidelines for usage, apply to this pattern as applied to the User Information Access pattern. However, when a “brokered” approach is utilized, additional IT drivers are often at play in bringing about this design. Specifically, the goals of “leveraging existing technology investments” and “reducing deployment and implementation costs” can be met with this approach. The usage of “adapters” can allow one to better leverage existing legacy interfaces to the data sources — reducing deployment costs. Additionally, should a data source need to be removed or replaced, only this adapter/connector logic would need to change, not the entire search application. Of course, this is again getting into Runtime pattern types of details, so we will again leave further discussion regarding search “adapters” to 5.5.1, “User Search and Discovery Runtime pattern” on page 86. 62 Patterns: Portal Search Custom Design 4.3.3 Self-Service application patterns compared At the outset, one may find architectural similarities between the User Information Access/User Search and Discovery application patterns and applications that automate the Self-Service business patterns. It is important to distinguish the differences as we close our detailed discussion of the Information Aggregation patterns. What actually distinguishes these two patterns is the user interaction with data (User Information Access) versus a business transaction (Self-Service). The User Information Access application patterns facilitate direct interaction between users and data, hence providing significant freedom and flexibility in accessing and manipulating data. Applications that automate the Self-Service business pattern enable direct interaction between users and business transactions, thus enabling users to electronically perform a business process. Data may be involved in these business transactions, but only smaller snippets of data. However, a few of the Application patterns for Self-Service can be used as front-ends to data stores, thus providing basic information access capabilities as well — again, for a smaller scope and size of data than supported by the Information Aggregation patterns. 4.4 Combining the patterns for search solutions The basic relationship of Application Integration and Information Aggregation patterns was discussed earlier in this chapter in Section 4.1, “An overview of the Application patterns” on page 44. Now that these patterns have been described in additional detail, we can more closely examine this relationship in regard to search solutions. In general, it is the Application Integration patterns that allow the Information Access patterns to search data sets. However, there are multiple ways in which this can be allowed, as shown in the multiple Application patterns we have discussed. For example, taken at the very basic level, a User Search and Discovery based application could have the ability to directly search a data source on its own, without required a population/indexing step, as depicted in Figure 4-11. Chapter 4. Application patterns 63 User "User Search & Discovery" based application Original Data Figure 4-11 A basic search solution More commonly, applications based on one of the Population application patterns, such as the Index population pattern, are utilized to combine multiple data sources into a single search “index”. A User Search and Discovery based application would then be created to provide the user interface, and perform the actual search operations against this “index” of data. This “index” may be a more advanced categorized and summarized index available via a taxonomy — but the same model would apply. Population applications populate this data, and User Search and Discovery based applications then allow users to search. This is depicted in Figure 4-12. User "User Search & Discovery" based application Searchable Data Index "Population" based application Original Data Figure 4-12 A single index search solution This basic model can then be expanded. If the User Search and Discovery based application was built with the optional “brokering” capabilities, then multiple indices could actually be searched and bring the user an aggregated and normalized result set. This is depicted in Figure 4-13. 64 Patterns: Portal Search Custom Design "Population" based application User Searchable Data Index "User Search & Discovery" based application Original Data "Population" based application Searchable Data Index Original Data Figure 4-13 A brokered search solution To take this a step further, the capabilities of the Federation application pattern could then be introduced to allow multiple data sources to be searched without populating a single consolidated index. Moreover, Federation capabilities could be utilized by a Population application to aid it in creating its index. This introduction of Federation capabilities is depicted in Figure 4-14. "Population" based application User "User Search & Discovery" based application Searchable Data Index "Federation" based application "Federation" based application Original Data Original Data Original Data Original Data Figure 4-14 A brokered and federated search solution Chapter 4. Application patterns 65 As shown in these various solutions, the clear separation of information access logic and population logic allows for the same information to be packaged differently based on the needs of the business problem at hand. This layered approach decreases the total cost of ownership, and increases solution flexibility. 4.5 Summary In this chapter we have introduced the Application patterns that are required for common portal search solution needs, and are part of the Portal Search custom design. As with the portal composite pattern as a whole, a single application pattern cannot accurately define the problem. Rather, to extend the portal for search capabilities, Information Aggregation application patterns (User Search and Discovery) must be introduced to provide the search interface and logic capabilities; while Application Integration patterns (Population and Federation) are required to prepare and integrate multiple data sources for searching. The addition of these patterns to the base Portal composite pattern, creates the Portal Search custom design. In the following chapter, we will describe the common Runtime patterns and Product mappings that take these higher level Application patterns down to the technology level. 66 Patterns: Portal Search Custom Design 5 Chapter 5. Runtime patterns After choosing the appropriate Business pattern and Application pattern, it is time to define the Runtime pattern and map the products used to implement it. Runtime patterns define functional nodes (logical) that underpin an Application pattern. The Application pattern exists as an abstract representation of application functions, whereas the Runtime pattern is a middleware representation of the functions that must be performed, the network structure to be used, and the systems management features, such as load balancing and security. In reality, these functions, or nodes, can exist on separate physical machines or may co-exist on the same machine. In the Runtime pattern, this is not relevant. The focus is on the logical nodes required and their placement in the overall network structure. This chapter introduces the Runtime patterns for our Portal Search custom design. © Copyright IBM Corp. 2004. All rights reserved. 67 5.1 Runtime node descriptions A Runtime pattern is represented by logical nodes, where each node has a specific role in the architecture. It defines the topology of the architecture and node placement. Most patterns will consist of a core set of common nodes, with the addition of one or more nodes unique to that pattern. To understand the Runtime patterns presented in this book, you will need to review the following common node definitions. Public Key Infrastructure (PKI) node PKI is a collection of standards-based technologies and commercial services to support the secure interaction of two unrelated entities (for example, a public user and a corporation) over the Internet. In the context of the topologies defined in this redbook, PKI supports the authentication of the server to the browser client, using the SSL protocol. Domain Name Server (DNS) node The DNS node assists in determining the physical network address associated with the symbolic address (URL) of the requested information. The DNS is that of the Internet Service Provider, although for additional security and more efficient use of network resources, the hosting environment where the portal implementation is housed can leverage its own DNS node. User / Internal User node This node is most frequently a personal computing device (PC, etc.) supporting a commercial browser, for example, Netscape Navigator or Internet Explorer. The level of the browser is expected to support SSL and some level of DHTML. Increasingly, designers should also consider that this node may be a pervasive computing device, such as a personal digital assistant (PDA). The internal user accesses the system from within the client’s network and/or via a VPN connection from the Internet. This user will also use some type of desktop or mobile based computing device. Directory and Security Services node This node supplies information on the location, capabilities and various attributes (including user ID/password pairs and certificates) of resources and users known to this Web application system. The node may supply information for various security services (authentication and authorization) and may also perform the actual security processing, for example, to verify certificates. The authentication in most current designs validates the access to the Web application server part of the Web server, but it can also authenticate for access to the database server. 68 Patterns: Portal Search Custom Design Protocol and Domain Firewall node Firewalls provide services that can be used to control access from a less trusted network to a more trusted network. Traditional implementations of firewall services include: Screening routers (protocol firewall) Application gateways (domain firewall) The two firewall nodes provide increasing levels of protection at the expense of increasing computing resource requirements. The protocol firewall is typically implemented as an IP router, and the domain firewall is a dedicated server node. The protocol firewall prevents unauthorized access to servers in the DMZ from the outside world by filtering incoming requests by protocol, access route, point of origin, and other data characteristics. The domain firewall prevents unauthorized access to servers on the internal network by limiting incoming requests to a tightly controlled list of trusted servers in the DMZ. Web Server Redirector node In order to separate the Web server from the application server, a Web Server Redirector node (or just redirector for short) is introduced. The Web server redirector is used in conjunction with a Web server. The Web server serves HTTP pages and the redirector forwards servlet and JSP requests to the application servers. The advantage of using a redirector is that you can move the application server behind the domain firewall into the secure network, where it is more protected than within the demilitarized zone (DMZ). Static pages can be served from the DMZ by this node. The Portal composite Runtime pattern supports this, since the ability to add additional Web servers and/or additional application servers, without affecting other portal nodes, is important for supporting scalability in the system and enhancing maintainability. The redirector can be implemented, for example, by either a reverse proxy server or by a Web server plug-in. Application Server node The Application Server node provides the execution and communication runtime environment for the business logic of the application. The business logic may be self-contained on the application server node. If not, the application server node is responsible for interacting with back-end applications and retrieving data from back-end data sources. The application server node typically enables infrastructure services such as persistence, resource connection pooling, scalability, failover, administration, and support for Java. Chapter 5. Runtime patterns 69 The application server node is often the central mechanism in the systems architecture to provide access to various back-end data sources and/or applications and to provide this access to mechanisms that are used to present data to the end-user. Presentation Server node The Presentation Server node provides services to enable a unified user interface. It is responsible for all presentation-related activity. In its simplest form, it serves HTML pages and runs servlets and JSPs. For more advanced patterns, it acts as a portal and provides the access integration services (single sign-on, for example). It interacts with the Personalization Server node to customize the presentation based on the individual user preferences or on the user role. The Presentation Server node manages the presentation of data extracted from multiple sources. Through the use of user profile information, business rules (personalization) and a mechanism for aggregating different information sources (static editorial data, content managed data, data from remote systems), an aggregated view of data can be displayed. This aggregated view can be tailored for different device types based on information known about the current user accessing the portal. Personalization Server (Rules Engine) node The Personalization Server node works with the Presentation Server node to customize the presentation with data that matches a user’s interest. The Personalization server identifies the type or class of the user based on information available about the user. Based on this classification, data taken from a content data store either in the Personalization tier or from back-end sources is selected for presentation to the user. It provides the mapping function of user classification to content data. The Personalization server contains the “rules” that determine what types of user’s can have access to certain type of information. These are also referred to as access control rules and are directly related to business rules and processes. This is referred to as the Personalized Delivery::Prescriptive Runtime pattern. The Personalization server also allows the user to design the content and the layout of the content that they see by explicitly choosing from a selection of options. This is referred to as the Personalized Delivery::Participator Runtime pattern. You can use either or both of these patterns for the Portal composite pattern. Collaboration node The Collaboration node provides synchronous and asynchronous modes of communicating between an organization. We call this a community. Community is empowered by collaboration, collaborative work between users. The 70 Patterns: Portal Search Custom Design Collaboration node provides interactive discussions (interactive messaging and chat functionality) and the sharing of documents/ideas (teamroom environment). Content Management node The Content Management node provides for the management of digital assets (for example, images, documents, “pieces” of text) and applies a workflow and security rules (for example, access control) to each discrete asset. Note that assets can also be referred to as “resources” (as they are in WebSphere Content Publisher). The Content Management node will commonly include and/or leverage these functions: Content Type / Category Identification Workflow (based on a user’s role and/or the type of content) Versioning (including rollback to previous versions) Handling of static or dynamic content Transcoding / reformatting of content (more recently added to handle multiple end-user channel device types) Storage of content to multiple data source types (for example, DBMS, file system) Search and Indexing node A Search and Indexing node provides a function to catalog and/or index the content data sources. This will provide the capabilities to locate specific content (for example, product or catalog information) and to update this search capability when updates are added (via indexing). In addition, this information can be “indexed” in a manner that provides the Presentation and Personalization server an ability to find information that is “associated” to the actions taken by the end-user. For example, this could provide for “cross-selling” or “up-selling” on a commerce site, which is a specific form of Implicit Personalization. For more details, refer to the “Predictive Personalization” Runtime pattern at: http://www-106.ibm.com/developerworks/patterns/access/at3-runtime.html Database Server node The Database Server node provides a persistent data storage and retrieval service in support of transactional interactions. The data stored is relevant to the specific business interaction, for example, bank balance, insurance information, or current purchase by the user. This node represents a common mechanism to manage all database management system-based data sources. Chapter 5. Runtime patterns 71 Pervasive User node A Pervasive User node is a “catch all” category of portal user that is all “mobile” (non-desktop) connected end-user devices other than a Web browser. In most current scenarios this includes devices such as mobile phones, personal digital assistants, and text pagers. Wireless Gateway node This node serves the information from the portal via alternative protocols to wireless devices 5.2 Runtime pattern for the Portal composite pattern The overall Portal composite Runtime pattern is represented in Figure 5-1. It contains nodes that are common in a portal implementation. However, note that based on specific business drivers, your implementation of a portal will include some or all of these nodes. In fact, it may contain nodes that are not included in this diagram and are specific to the implementation. The Portal composite Runtime pattern combines the characteristics of many different Runtime patterns. Through this combination of characteristics a composite picture of those “components” that are generally implemented for, and add value to, a portal implementation become clear. Below is a list of the characteristics that generally make up Runtime patterns and the additional nodes that are included for the Portal composite pattern. Familiar functionality is included in this pattern, such as: Databases and data sources A data integration and business logic mechanism, via an application server A security and user directory system Browser based access In addition, there are these portal specific nodes and functions: Content Management Collaboration (the synchronous chat and messaging form) Workflow (part of the application server and/or Content Management) A Presentation Server (generates properly formatted output) A business rules (Personalization) engine Search and Indexing Multi-client device and formats supported (for example, Wireless gateway) A portal implementation leverages the concept of personalization, multi-device type access, a presentation rendering mechanism, and a business rules engine. These are combined with the ability to search and index content (of various types 72 Patterns: Portal Search Custom Design and formats), provide collaboration, and manage content via a workflow to provide both content aggregation and a collaborative environment. The Portal composite Runtime pattern represents a starting point for most portal implementations, providing a way to identify those functional areas that will likely need to be addressed when considering this type of implementation. Consequently, the Portal composite Runtime pattern shown in Figure 5-1 represents a preliminary step towards an operational architecture that can be implemented in a target environment to provide secure data aggregation, multi-client access and collaboration. Outside world DMZ Internal network Directory and Security Services I N T E R N E T Domain Name Server Pervasive Device Collaboration Personalization Server Web Server Redirector Domain Firewall Browser User Search & Indexing Protocol Firewall Public Key Infrastructure Database Server Content Management Presentation Server Application Server Wireless Gateway Figure 5-1 The basic portal composite Runtime pattern 5.3 Runtime pattern for Portal Search custom design Now that we have defined the Portal composite pattern’s overall Runtime pattern, it is important to now show the differences for the Portal Search custom design. This “custom design variation” of the overall Runtime pattern is depicted in Figure 5-2. The main differences between these two Runtime patterns center on the searching and indexing node. The basic Portal composite pattern included a single node defined as “searching and indexing”. However, this searching and indexing capability was really focused on data included within the portal itself — as a companion to the Content Management node, and consequently the content Chapter 5. Runtime patterns 73 management capabilities available within the portal. When we bring this Runtime pattern to the level of a more comprehensive portal search solution, this searching and index capability needs to be separated from the realm of content management and portal specific data — to allow it to handle diverse and external data sources. Furthermore, as described in the Application patterns discussion earlier in this redbook, there is a need to separate the Information Aggregation (that is, user information access) and data focused Application Integration (that is, “federation” and “population”) aspects of search into separate nodes. As these concepts represent distinctly different application and performance needs, they should therefore be considered separately in the runtime model. Thus, separate “search”, “population”, and “federation” nodes are defined within this Custom design runtime model, as shown in Figure 5-2. Outside world DMZ Internal network Population Domain Name Server Pervasive Device Personalization Server Web Server Redirector Domain Firewall Browser User I N T E R N E T Protocol Firewall Public Key Infrastructure Directory and Security Services Federation Database Server Search Presentation Server Application Server Wireless Gateway Collaboration Content Management Figure 5-2 The Portal Search custom design Runtime pattern Within this overall Custom design Runtime pattern, the key nodes that provide for the search capabilities are the Database Server node, the Presentation Server node, the Search node, the Federation node, and the Population node. These nodes interact as follows: The Database server node provides the persistent data storage and retrieval services as required by all other nodes in the Runtime pattern. The Federation node provides for a unified interface into multiple isolated data sources. It is a key component of an overall search solution when many complex and diverse sources must be integrated. 74 Patterns: Portal Search Custom Design This node represents the high level runtime implementation of the Application Integration::Federation application pattern. The Population node provides the functionality to communicate with raw data sources to perform the indexing, and optional categorization/summarization processing, required to produce searchable data indices of this original data. The Population node may access multiple data sources directly, or may access a single “virtual” data source as presented by the Federation node, or any combination of these two. This node represents the high level runtime implementation of the Application Integration::Population: Index Population application pattern. The Presentation node is the same node utilized in the generic Portal Composite pattern to enable a unified user interface. In terms of search capabilities, this would involve communications with the Search node to pass along user requests, receive search responses, and potentially format the search response. Depending on the search technologies utilized, the search responses may be sent back to the presentation node in an already formatted manner (that is, full HTML), or may be passed in a manner that requires the presentation node to transform the data (that is, XML) into the appropriate formats such that a single aggregated view of all portal content can be presented to the user. The Search node provides the core engine that communicates with the various search indices and data sources, and performs the actual searches based on users requests. In a portal solution, the portals “presentation server” node performs the user interface with the end user, although it will usually leverage various UI capabilities of the search node (via product APIs). This search node will also provide any extended search capabilities in the solution. This includes search brokering and aggregation to provide for a single unified search result across multiple disparate data sources. The combined Presentation and Search nodes represents the high level runtime implementation of the Information Aggregation::User Search and Discovery application pattern. In the next few sections of this chapter we will look at the specific Runtime patterns for each of these application patterns in more detail, examining more closely how the key Portal Search custom design application patterns, identified in Chapter 4, “Application patterns” on page 43, map into this overall solution Runtime pattern. For details on the general portal aspects of this Runtime pattern, please refer to these IBM Redbooks: Patterns: A Portal composite pattern using WebSphere Portal V4.1.2, SG24-6869 Chapter 5. Runtime patterns 75 Patterns: A Portal composite pattern using WebSphere Portal V5, SG24-6087 5.4 Application Integration Runtime patterns As discussed earlier in this redbook, the following Application Integration patterns are important for the understanding of a search solution: Data movement application patterns: – Population: Single Step • Population: Multi-step variation • Population: Data Cleansing variation – Population: Index Population – Population: Synchronization Federated access application patterns: – Federation As also discussed, it is primarily the “Population: Index Population” and “Federation” application patterns that play in a given search solution. Thus, these patterns are the ones described to the runtime level in this chapter. However, some of the other Population patterns (multi-step, synchronization, and data cleansing) are also discussed in their relation to various compositions of the Population: Index Population and Federation Runtime patterns. 5.4.1 Population: Index Population Runtime pattern Figure 5-3 depicts the basic Runtime pattern for applications implementing the Population: Index Population application pattern. 76 Patterns: Portal Search Custom Design Population Runtime pattern Data Server / Services Data Server / Services Metadata Application pattern Index and Taxonomy Retrieve & Parse documents Index, Summarize, & Categorize Method App N App 2 Sources Figure 5-3 Application Integration::Population: Index Population - Runtime pattern This population Runtime pattern is in fact the same Runtime pattern that would be used to represent many of the other population Runtime patterns as well. For example, the “Population: Single Step” pattern would follow this same model — as would the “Population: Data Cleansing” variation. In the case of the Data Cleansing variation the population node would perform the cleansing activities, while in this index pattern the population node performs indexing activities. The usage any overall system designs relying on population capabilities. In the case of this patterns application to a “Population: Index Population” pattern, the population activity taking place in this Runtime pattern is as follows: The overall Population application “indexing and summarization activities” begin processing data from the original data sources. Data sources are usually processed one at a time, in a sequential order. As the data is being processed, the usage of a temporary work queue may be required. This work queue data can be stored in files, message queues, or databases. The final index of all documents, including document metadata and summarizations, is written to a resulting index data file. Advanced categorization When advanced categorization and taxonomy generation capabilities are required, a slightly more advanced variation of this Runtime pattern may sometimes be required. In this variation, the “Population: Index Population” needs are fulfilled in a “Population: Multi-step” type of approach, as shown in Figure 5-4. Chapter 5. Runtime patterns 77 Runtime pattern Population (Categorize) Data Server / Services Metadata Application pattern Load Population (Index) Data Server / Services Population (Index) Metadata Metadata Transform Extract Data Server / Services App Target Source Figure 5-4 Application Integration::Population: Multi-step (applied to indexing) By examining the Runtime pattern, one can see that the categorization capabilities have been split out into a separate runtime node. The Categorization node takes the index and summarized search data, and performs the categorization processing to produce a document taxonomy including indexed and categorized data. This runtime design allows for more flexibility in terms of scaling the overall system to support the indexing/processing of larger data sources, and produce larger resulting search indices. However, there is one other manner in which the need for a more scalable “Population: Index Population” can be handled. This alternative approach involves the combination of multiple Population patterns into one overall runtime solution, as shown in Figure 5-5. 78 Patterns: Portal Search Custom Design Runtime view Data Server / Services Population (Index) Population (Categorize) Data Server / Services Population (Index) Data Server / Services Data Server / Services Application pattern Figure 5-5 Multiple “Population: Index Population” patterns combined As shown in the diagram, multiple “Index Population” patterns have been combined into a single solution. This solution allows for the same scaling flexibility as the “Population: Multi-step” variation discussed earlier. However, this approach also allows for the separation of the document index data from the document categorization data. Such a separation of the resulting index and categorized data would allows for a separation of the database load from those users performing searches, versus those users walking the taxonomy of categorized documents. Overall, this variation (while still simplistic) represents a high scale enterprise solution, with maximum flexibility in terms of horizontal scaling. External data To take this discussion of combining patterns at the runtime level one step further, let us examine situation involving the need for external data to be included within a population routine. In such a case, one option would be for the core “population node” processing to continue to function on an internal network, access external data through the enterprise firewall. This approach is shown in Figure 5-5. Chapter 5. Runtime patterns 79 Outside world DMZ Internal network Data Server/ Services External Data Population Data Server/ Services Internal Data Data Server/ Services Metadata Population method Target App Source Figure 5-6 “Index Population” applied to external data In such cases, careful consideration must be taken in regards to the security and encryption of all communication to the external data sources, so that data security and integrity is not compromised as the external data flows across the unsecured external network. Data replication Another way to handle the external data at a runtime level would be to implement a “replication” routine that would handle the making of an internal “copy” of the external data. In some cases, this replication node would be placed within the “DMZ” layer of the network, to allow for a more secure separation of the network traffic. This approach, can be taken such that the internal “replica” of the data is one way “pull” of data only, in which case this solution might be really the combination of the “Population: Single Step” and “Population: Index Population” patterns. Alternatively, the internal replica of the data could be a full two way synchronization, in which updates to the internal copy of this data are allowed and then synchronized back out to the original external copy. This case, the approach depicted in Figure 5-6, is actually a combination of the “Population: Synchronization” and “Population: Index Population” patterns. 80 Patterns: Portal Search Custom Design Outside world Data Server/ Services DMZ Internal network Population Data Server/ Services Replicated External Data External Data Population Data Server/ Services Internal Data Data Server/ Services Metadata Synchronization method Population method Target App Source Figure 5-7 “Index Population” applied to external data via “synchronization” This solution does have an advantage in that it may minimize the network traffic of the system, as the traffic associated with data lookups for an indexing process would be more intensive than the simple replication/synchronization traffic. However, this runtime solution has its disadvantages as well. For example, the security concerns would be similar to a solution without replication. One would still need to ensure that the communication of data from the external network is secure during the replication/synchronization process. Additionally, keeping the schedules of synchronization updates matched up with the schedule for re-indexing of the data may also become problematic — resulting in slower updates to the data indices — and more delayed search results to end users. This approach would also not be feasible when the external data is owned by an external entity. In such cases, the external entity will probably be hesitant to allow a full copy of their data to another location, where they would lose all control of the data security. Chapter 5. Runtime patterns 81 Data extraction One final combination of Runtime pattern to consider in terms of a runtime solution is in regard to the “load” on the system of accessing and extracting data from the original data sources. The logic required for the indexing system to “speak” the multiple data source languages, and then actually process these disparate data sources into a common data model can be quite intensive. This can result in poor performance of the indexing process, and very long processing times for the indexing node. To again allow for more flexibility in terms of performance and horizontal scaling, an additional population routine can be added to the solution. In such an approach, an extraction process would run against all data sources. This extraction node would then process all the data, and transform it into a common data model — ultimately resulting in a single source of “normalized data” upon which the indexing focused population node can focus its efforts. This approach can once again be implemented in more that one way. The data extraction could be a simple data transformation process, which might be implemented in a “Population: Multi-step” fashion. Alternatively, the data extraction could require more advanced cleansing of errors and other issues, in which case the solution would be a combination of the “Population: Single Step” and “Population: Data Cleansing” patterns as depicted in Figure 5-8. Runtime view Data Server / Services Data Server / Services Population Indexing Population Data Server / Services Data Cleansing Application pattern Figure 5-8 “Index Population” combined with “Population: Data Cleansing” 82 Patterns: Portal Search Custom Design Data Server / Services 5.4.2 Federation Runtime pattern The Federation application pattern was discussed earlier in this redbook, within 4.2.4, “Federation application pattern” on page 55. It is designed to create a unified query interface into isolated structured and unstructured repositories. Figure 5-9 depicts the basic Runtime pattern for applications implementing the Federation application pattern. Runtime pattern Data Server / Services Data Server / Services Data Integration Data Server / Services Metadata Application pattern App N read App 1 Data Integration Method App 2 read/write Figure 5-9 Application Integration::Federation — Runtime pattern The population activity taking place in this Runtime pattern is as follows: A requesting application makes a query of data from the “federated” data source — for example, a simple SQL Select request. The data integration node processes the request; and utilizing its metadata, which defines the data sources, it passes on the requests to the appropriate data sources. In many cases, the data integration/federation logic within the Data Integration node may be logically separate from “data connector” logic. This data connector logic spreads out the load of making the query to multiple data sources — allowing the queries to run simultaneously against each database. When performance is of major concern, multiple logical data connectors may exist to process queries against a single data source — the idea here being to eliminate any single node in the process from becoming a bottleneck, if too many requests run against one data source. Chapter 5. Runtime patterns 83 In all cases, the results that are returned from each individual data source must then be aggregated and normalized by the data integration layer so that these results appear to be from one “virtual” data source. The results are then sent back to the requesting application, which has no idea that multiple data sources were involved. External data As discussed with the Population Runtime patterns, the inclusion of external data into such a “federated” data source model may also be required. In such a case, one option would be for the core “federation node” processing to continue to function on an internal network, accessing external data through the enterprise firewall. This approach is shown in Figure 5-10. Outside world DMZ Data Server/ Services External Data Internal network Application Server/ Services Data Server/ Services Data Integration Internal Data Metadata App N read App 1 Data Integ Tier read/write App 2 Figure 5-10 Federation — Runtime pattern — with external data While this runtime design provides for a clean integration of external data, the performance of federation queries in a model such as this, which contains both internal and external data, may be poor. Depending on the speed of links and firewalls in between the federation interface and the external data source, one external data source could slow down the entire query — even if internal data sources responded immediately. As a “federation” request happens in real-time from a requesting application, putting an external resource and network in the middle of this process can effectively affect larger enterprise wide applications. 84 Patterns: Portal Search Custom Design Additionally, as with other “external data” Runtime patterns discussed in this chapter, security of requests to the external data source would also be a concern. All communication channels to the external data source would need to be carefully secured and encrypted. Of course, another alternative would be to introduce the “Population: Synchronization” pattern combined with the Federation pattern, in a manner similar to the Synchronization/Index Population combination shown in Figure 5-7 on page 81. However, the same concerns would exist in this runtime solution as with the Synchronization/Index Population data runtime solution — namely, concerns about the scheduling of data replication updates and security concerns of owners of the external data. 5.5 Information Aggregation Runtime patterns The Information Aggregation business pattern captures the process of taking large volumes of data, text, images, video, and so on, and using tools to extract useful information from them. As discussed earlier in this redbook, Information Aggregation patterns normally leverage the data focused Application Integration patterns to provide a “search” interface into an aggregated set of data. As we have already reviewed the runtime implementation of the Application Integration patterns, it is now time to focus on the Information Aggregation pattern runtime implementations. There are two key applications patterns in this Information Aggregation area that have so far been discussed at the application level: User Information Access (UIA) User Search and Discovery (US&D) However, while these patterns differ at the application level due to their focus on structured data (UIA) versus unstructured text (US&D), their runtime implementations are very similar. Thus, we will focus on the User Search and Discovery pattern, as search is the focus of this redbook — but the runtime discussion of this pattern will apply to the UIA pattern as well. Chapter 5. Runtime patterns 85 5.5.1 User Search and Discovery Runtime pattern The User Search and Discovery pattern can be discussed at a basic level, followed by several more advanced, and more common, variations. To start, the basic runtime model for this pattern is shown in Figure 5-11. W32 App Runtime pattern Browser Web Application Server Search Metadata Application pattern Pres. Search, Discover & Optional Additional Function Data Server/ Services Retrieve & Parse information App N App 2 Figure 5-11 Information Aggregation: User Search and Discovery — Runtime pattern In this Runtime pattern, client users may be either “thick” clients operating via a product API (represented by the Win32 node in this Runtime pattern), or “thin” Web browser based clients interacting via standard Web application models. The presentation/Web application server node handles the user interaction, and passes search requests to the search node. This search node then analyzes the search request, and based on its configuration processes the search, and returns the results to the user through the presentation/Web application server node. Search adapters However, especially in a portal implementation, a search technology will often be expected to have “brokering” capabilities. This is the ability to search multiple data indices simultaneously, and provide a seamless consolidated result set with reasonable response time. In most cases, such multi-source brokered search solutions will be implemented via the inclusion of “search adapter” nodes, as depicted in Figure 5-12. 86 Patterns: Portal Search Custom Design Runtime pattern Browser Web Application Server Search Adapter Search Retrieve & Parse information Metadata Application pattern Pres. Data Server / Services Search, Discover & Optional Additional Function App N App 2 Figure 5-12 Iser Search & Discovery — search adapter variation — Runtime pattern The search adapters, then, contain the logic to interface with the actual data indices — either interacting directly with the index at a database level, or interacting with a controlling application’s search interface (normally via a product API). In some cases, multiple search adapters may exist to process queries against a single data index. The idea here is to eliminate any single node in the process from becoming a bottleneck, if too many search requests run against one source. The data indices themselves are regularly updated by the “population” based capabilities discussed in other Runtime patterns earlier in this chapter. The search adapters then return the search results from each individual data index, which the search brokering node must then aggregate and normalize so that the results appear to be from one “virtual” source. The aggregated results are sent back to the presentation/Web application server node, which must format the results for sending back to the client that requested the original search. Search results may be sent back to the presentation node in an already formatted manner (that is, full HTML), or may be passed in a manner that requires the presentation node to transform the data (that is, XML) into the appropriate formats. Similarly, the presentation/Web application server node may then pass the search results to the client in either a fully formatted manner, or may leave the final formatting of the search results up to the client application. Overall, this separation of the search capabilities into multiple nodes in the Runtime pattern, when they were simply defined as a single “data integration” tier in the application pattern view, provides for a level of flexibility in terms of maintenance and performance tuning that would otherwise be unavailable should only a single node provide the required capabilities. Chapter 5. Runtime patterns 87 Search services Another variation of this Runtime pattern to consider is support for the cases in which the “data” being search is not data itself, but rather a “search service” with a defined API for interaction. In such a case, an additional variation would apply to highlight the fact that the search adapters may not directly access a data index, but may interact with an existing search service to actually perform the search. This variation is shown in Figure 5-13. Data Server / Services Runtime pattern Browser Web Application Server Search Search Adapter Metadata Application pattern Pres. Search, Discover & Optional Additional Function Search Retrieve & Parse information Data Server / Services App N App 2 Figure 5-13 Iser Search & Discovery — search service variation — Runtime pattern The main difference between this variation, and the search adapter variation, is the inclusion of another search node — representing the external “search service” to be accessed by the search adapter. As far as the user of the application is concerned, they do not know that the search is being performed against a data index directly, or via such a search service. External users and data The User Search and Discovery application pattern is the first pattern we have discussed to the runtime level that includes direct user interaction. Thus, it needs to take into account, not only external and internal datasources and systems as we have considered with other Runtime patterns, but internal and external users as well. In any application that has direct user interaction, the involvement of internal “trusted” users, versus external “untrusted” users, must be considered. The basic runtime model for this pattern already discussed does not truly take this into account. To represent such external users, we must view this Runtime pattern on top of a standard three-tier e-business environment. This is the view shown in Figure 5-14. 88 Patterns: Portal Search Custom Design Outside world DMZ Internal network External User Browser Search Web Application Server Data Server/Services Search Adapter Search Adapter Search Data Server/ Services Search Adapter Search Data Server/ Services Data Server/ Services External Data Figure 5-14 User Information Access — Runtime pattern — external users and data To support external users, a “Web application server” node will normally be placed in the DMZ network layer. This node would service HTTP requests from the Web browser based thin client, passing each search application request (normally JSP or servlets) to the presentation server. Actually, in most cases the Web Application Server node would really be split into two nodes: a Web Server “redirector” node, which would be placed in the DMZ; and a Web Presentation Server node, which would remain on the internal network — thus ensuring that any user interface logic is safe from hacking. Security is of course important whenever an external user is introduced. Depending on the sensitivity of the data included within search results, all communications between the thin client user and Web server must be encrypted. In most e-business Web based applications, this will be done via usage of the SSL encryption, resulting in encrypted HTTP traffic (HTTPS). However, the introduction of any additional encryption to an existing system can result in a dramatic performance impact. Therefore, the ability of any system to support the additional encryption loads should always be considered. Tip: Of course, to alleviate the impacts of such encryption, special “SSL appliances” can be utilized to off-load the encryption/decryption processing from the rest of the system. Typically, thick client users would only be supported internally, as it can be problematic to properly secure external API based requests. Chapter 5. Runtime patterns 89 External data There will also be situations in which external and internal data indices will be included in an “extended search” implementation of this Runtime pattern. In such situations, the various search nodes will continue to run within the internal network, and a search connector will be placed within the external network close to the external data itself. Alternatively, the search connector node may be placed in the DMZ layer to limit any compromise of the search connector logic. This support for external data is also shown (along with the external users) in Figure 5-14 on page 89. Similar to the impacts of external data in the “federation” Runtime patterns already discussed, the inclusion of any external data sources within an “external brokered search” implementation of the User Search and Discovery pattern can have unexpected performance results. Depending on the speed of links and firewalls in between the search broker and the external data index, this one external data query could slow down the entire search query — even if internal data searches responded immediately. However, this may have less of a “domino affect” than in the case of the Federation Runtime pattern, as only a single user might experience this delay — versus the application “customers” of pure data federation. Ultimately, users may consider a small performance hit acceptable in exchange for a single search result that integrates both internal and external data. 5.5.2 Information Aggregation in business intelligence solutions The runtime designs that have been presented so far for each of the Information Aggregation patterns have been focused specifically on general search related business problems. Typically, such search capabilities will be enabled in a knowledge management type of solution, with the goal of helping users make sense of large volumes of unstructured content. However, these patterns can also apply to the more structured data analysis problems common in business intelligence (BI) types of solutions. For more understanding in how these patterns would function at the runtime level for these solutions, please reference the IBM patterns for e-business Web site at: http://www-106.ibm.com/developerworks/patterns/bi/at5-runtime.html 90 Patterns: Portal Search Custom Design 5.6 Combining the Runtime patterns At this point we have presented a high level view of the overall Portal Search custom design Runtime pattern in Figure 5-2 on page 74. We have also presented detailed views to the various Application Integration::Population, Application Integration::Federation, and Information Aggregation::User Search and Discovery Runtime patterns. However, as discussed in 4.4, “Combining the patterns for search solutions” on page 63, all of these search related patterns must really be combined to result in a single comprehensive e-business search solution. Thus, it is helpful to view all three of these patterns in a single consolidated yet detailed Runtime pattern to clearly understand the interactions of these patterns. Figure 5-15 depicts one potential solution shown at the Runtime pattern level combining several of these patterns. Federation Data Server/ Services Population Data Server/ Services Replicated External Data External Data Population Browser Web Application Server Search Adapter Search User Search and Discovery Data Server/ Services Data Integration Search Adapter Internal Data Data Server/ Services Population Population Figure 5-15 “Search” Runtime patterns combined — federation, population, and user search Chapter 5. Runtime patterns 91 As can be seen in this combined Runtime pattern, the combination of all of these application patterns is what truly creates a complete search solution. In this composition of patterns, we have external data being replicated internally, then federated with existing internal data to appear as one search source. Then we have another search source being generated by a instance of the Population: Index Population pattern. All three of these data sources are ultimately being searched by external users via one single query. The user is oblivious to the location and virtualization of the data. When taking the knowledge gained from this “combined” search Runtime pattern, and adding its concepts to the overall Portal composite Runtime pattern, we ultimately end up with the overall Portal Search custom design Runtime pattern presented in Figure 5-2 on page 74. The “search”, “federation”, and “population nodes” in this overall custom design Runtime pattern represent the three key search Runtime patterns we have discussed in this chapter. 5.7 Summary In this chapter, we have stepped down to the “runtime” level of search solutions. We have discussed the common runtime nodes typically used, and the variations of these nodes, depending on the requirements for performance and security, and integration with external data sources. In the next chapter we will take these Runtime patterns and map them down to the product level. That is, we will highlight the specific IBM technologies and products that can be utilized to implement each runtime node in the real world. 92 Patterns: Portal Search Custom Design 6 Chapter 6. Portal Search product mappings After choosing the appropriate Runtime pattern, it is time to map the Runtime pattern and the products used to implement it. A product mapping maps the logical nodes defined in the Runtime pattern to specific products that implement the Runtime solution design on a selected platform. The product mapping identifies the platform, software product name, and often version numbers as well. © Copyright IBM Corp. 2004. All rights reserved. 93 6.1 Mapping the Runtime pattern In order to expedite the process of implementing any pattern, existing products can be chosen that already contain the necessary functionality. In many cases, additional customization of these products will be necessary to meet the business drivers. The goal is to choose a set of products and technologies that minimize the need for customization. Thus, the next step after choosing a Runtime pattern is to determine the actual products, technologies, and platforms that form a best fit for the desired solution. In addition to the business drivers, consider these principles when determining a product and technology mix: Existing systems and platform investments Customer and developer skills available Customer choice Future functional enhancement direction The products and technologies chosen should fit into the target environment and ensure quality of service, such as scalability and reliability, so that the solution can grow along with the e-business. 6.1.1 Functional mappings The Portal Search custom design Runtime pattern, based on the Portal composite Runtime pattern, is constructed to be product and technology agnostic. Figure 6-1 contains those “functions” that various nodes will provide, and these functions can be mapped to specific products, a group of products, or multiple products providing functionality to more than one node. Note: Refer to 5.1, “Runtime node descriptions” on page 68 for additional information regarding the various nodes identified as part of the Portal Search custom design, and a detailed description for each node as needed. 94 Patterns: Portal Search Custom Design DMZ Outside world Internal network Search/Query engine Search Brokering Indexing Categorization Summarization Protocol filtering Circuit level gateway support Proxy functionality Population Domain Name Server Pervasive Device Web Server Redirector Domain Firewall Browser User Data federation / virtualization unified query interface Directory and Security Services Federation Personalization Server I N T E R N E T Protocol Firewall Public Key Infrastructure Database supporting LDAP User Records Group Data Organizational Data Servlet Redirector Database Server Search A rules engine mapping portal "resources" to user "groups" Presentation Server Application Server Wireless Gateway Alternative protocol gateway (WAP, UDP, etc.) RDMBS or Structured file system data source Content Transcoding Access Control List Content Aggregation Collaboration Content Management Synchronous interaction (Instant Messaging) Asynchronous interaction (content management) Data source connectivity (remote & local) Business logic Transaction management Content transcoding Content Publishing Content Creation Content Editing Content Approval Workflow Content Transcoding Presentation Templates Versioning Figure 6-1 Portal search custom design Runtime pattern::Functional mappings 6.1.2 Product mappings Once the Runtime pattern has been chosen and functions have been identified, a set of products and technologies must be applied so that detailed design and implementation can occur. As this is an IBM Redbook, we will focus on the IBM products and technologies that map to these runtime nodes. However, technologies from other vendors may also apply, should you have existing technologies in your environment that you wish to leverage for some of these capabilities. There are actually multiple IBM products that have the correct balance of scalability, maintainability, and extensibility to support this Runtime pattern. Figure 6-2 provides a list of the IBM products that can be applied to the different runtime nodes in this custom design, depending on a specific solution’s needs. Chapter 6. Portal Search product mappings 95 DMZ Outside world Internal network Lotus Discovery Server WebSphere Portal (Juru) Search Engine Lotus Domino Lotus Extended Search DB2 Information Integrator for Content (EIP) Lotus Extended Search AND/OR DB2 Information Integrator for Content (II4C, formerly known as EIP) IBM SecureWay Firewall Population Domain Name Server Pervasive Device IBM WebSphere Everyplace Suite DB2 Information Integrator and Information Integrator for Content Directory and Security Services Federation Personalization Server Web Server Redirector Domain Firewall Browser User I N T E R N E T Protocol Firewall Public Key Infrastructure IBM SecureWay Directory Server OR Lotus Domino with LDAP service AND/OR IBM Policy Director (WebSEAL) Database Server Search IBM WebSphere Personalization Server Presentation Server Application Server Wireless Gateway IBM HTTPD Server IBM DB2 Universal Database IBM WebSphere Portal Server AND (for search presentation) Lotus Extended Search OR DB2 Information Integrator for Content Collaboration Content Management Lotus Sametime AND/OR Lotus Quickplace AND/OR Lotus Domino IBM WebSphere Application Server Advanced Edition OR IBM WebSphere Application Server Enterprise Edition IBM Web Content Publisher OR IBM Content Manager OR Lotus Domino.DOC Figure 6-2 Portal Search custom design Runtime pattern::Product mappings Note: A specific operating system for each node is not listed in the product mappings, as there are generally several options, because IBM’s products run on a multitude of platforms (for example, Windows 2000, Linux, AIX®, etc.). For the scenario implemented in this redbook, Microsoft® Windows 2000 Server with Service Pack 3 was utilized for all servers. More details on the runtime product mappings chosen for this book’s technical scenario can be found in 10.1, “The runtime environment” on page 168. 96 Patterns: Portal Search Custom Design 6.1.3 Network protocol mappings Finally, as shown in Figure 6-3, the network protocols used for a typical installation of this Runtime pattern are as follows: HTTP/HTTPS: Hypertext Transfer Protocol (HTTP), or Hypertext Transfer Protocol Secure (HTTPS), is used from the user’s Web browser to the HTTP server in the Web server redirector node. HTTP, or HTTPS, is also used from the WebSphere Web server plug-in the Web server redirector node to the Web container in the Presentation server node as well as from the collaboration and content management node to the Presentation server node. Finally, HTTP or HTTPs may be used between the Search node and the Presentation server node. LDAP/LDAPS: The presentation and application server uses Lightweight Directory Access protocol (LDAP) to access the LDAP Server in the Directory and Security Services node. LDAPS is the secure LDAP connection to a directory server using SSL. Since LDAP directories store essential and sensitive applications, as well as business information; the communication can use LDAPS to be secure. JDBC: The application server node and the Directory and Security Services node uses a Java Database Connectivity (JDBC) driver to access the database server node. The Search, Population, and Federation nodes may also use JDBC to communicate with their applicable data sources. The Search node will communication either directly, or through the Federation node. RM/IIOP: The personalization server node uses Remote Method Invocation (RMI) over Internet Inter-Orb Protocol (IIOP) to access the EJB container in the presentation server node and EJB container in the application server node. RMI/IIOP is also used from the presentation server node to the EJB container in the application server node. Additionally, RMI may be the method of communication to the Search functionality available in the Search node as well. Note: Two application servers can also communicate via HTTP with SOAP using the Web Services technology. Chapter 6. Portal Search product mappings 97 DMZ Outside world Internal network JDBC Directory and Security Services Population Domain Name Server Pervasive Device LDAP/LDAPS Federation Personalization Server RMI/IIOP/JDBC Wireless Gateway Web Server Redirector Domain Firewall Browser User I N T E R N E T Protocol Firewall Public Key Infrastructure Search RMI/IIOP/HTTP RMI/IIOP Presentation Server HTTP/IIOP JDBC Database Server JDBC Application Server HTTP HTTP/HTTPS Collaboration Content Management Figure 6-3 Portal Search custom design Runtime pattern — Protocol mapping 6.2 Product descriptions This section provides some background and details on each IBM product identified in the product mappings for this Portal Search custom design. This information is required to more fully understand these product mappings. 6.2.1 Lotus Extended Search IBM Lotus Extended Search is a scalable, server-based technology that searches in parallel across many content and data sources, returning integrated query results into a Web application. It provides the following capabilities: Single search: Find relevant information from multiple sources with a single search using only a Web browser. Parallel searching: Search in parallel across structured and unstructured data stores, including popular Web search sites, Lotus sources, RDBMS, Index and Directory sources, Content Management applications, Sametime® users, and more. Single result set: Get aggregated results presented as a single, ranked result set. 98 Patterns: Portal Search Custom Design Integrate with e-business applications: Easily integrate search capability into e-business applications via a strong Java based API and SOAP/Web Services interface. Scalable search: Support scalable enterprise search requirements across departmental and geographic locations. Save, resuse, share searches: Save, re-use and share searches. Store and forward search results: Store search results and/or forward search results to workflow or personalization applications. Identify people: View shared searches to identify people with similar interests. Extended Seach — key components There are really three key “components” or the extended search product: clients, brokers, and links. Clients: Extended Search Includes several ready-to-use, customizable search applications — including “portlets” for usage in WebSphere Portal. However, it also provides a solid API to allow for custom client development. The Extended Search common API is available in three forms: You can embed data beans and Extended Search Java Knowledge Management (JKM) tags in HTML pages. If you use a Lotus Domino™ Web application server, you must use this approach. If you use IBM WebSphere Application Server, you can choose to use this approach. You can create JavaServer Pages (JSP) and use Extended Search JSP beans or JSP tags to embed search functionality. This approach is supported by WebSphere Application Server. You can use the Extended Search Simple Object Access Protocol (SOAP) interface, which enables you to provide search functionality as a Web service. Brokers: Extended search brokers manage and synchronize requests and responses between multiple clients and back-end systems. They: Distribute queries for efficient, parallel searching Aggregate and filter search results Enable peer-to-peer communication to search across departmental, corporate, and geographic boundaries Chapter 6. Portal Search product mappings 99 Links: Extended Search links are the software modules that encapsulate the native API calls for search and retrieval to a specific data management system. They contain all of the required data structures, programming objects, and procedural logic necessary to interface with the back-end data system: They can connect brokers to targeted data stores, including: IBM Lotus Notes®, IBM DB, Oracle, Sybase, Microsoft SQL Server, Microsoft Access, IBM Lotus Discovery Server™, IBM Lotus Instant Messaging (Sametime), IBM Lotus Team Workplace (Quickplace), IBM Lotus Domino.Doc®, IBM WebSphere Portal Search Engine, Domino Domain Index, IBM Secureway, Microsoft Index Server, Microsoft Site Server, Microsoft Exchange,LDAP Server directories, file systems, and over 18 Web search sites (such as Hotbot, Excite, Alta Vista, News and User groups, etc). They can translate queries into the native search languages of the target data stores. They can be created to access additional data stores. More details on Lotus Extended Search can be found within the various documents and whitepapers available at: http://www-10.lotus.com/ldd/notesua.nsf/find/les Extended Search is also implemented within the technical scenario detailed in later this redbook, and the product architecture is described in more detail within Appendix B, “Understanding the Lotus Extended Search architecture” on page 207. 6.2.2 DB2 Information Integrator The DB2® Information Integrator™ product sets are designed to address customer requirements for integrating structured, semistructured and unstructured information effectively and efficiently. This product is broken into both structured data, and unstructured text capabilities: DB2 Information Integrator V8.1 IBM DB2® Information Integrator V8.1 provides integrated, real-time access to diverse data as if it were a single database, regardless of where it resides. The federated server capabilities of this product allow users to: Create an abstract relational view across diverse data Use existing reporting and development tools Rely on leading-edge cost-based optimization 100 Patterns: Portal Search Custom Design The replication server capabilities of this product allow users to: Manage data movement strategies, including distribution and consolidation models Monitor synchronization processes This product supports multiple data sources, including: DB2 Universal Database™ Informix® MS SQL Server Oracle, Sybase, Teradata, ODBC and others More details on the DB2 Information Integrator (for data) can be found within the various documents and whitepapers available at: http://www-3.ibm.com/software/data/integration/db2ii/support.html DB2 Information Integrator for Content (EIP) IBM® DB2® Information Integrator for Content (formerly Enterprise Information Portal in versions 8.1 and earlier) provides broad information integration and access to: Unstructured digital content such as text, XML and HTML files, document images, computer output, audio and video Structured enterprise information via connectors to relational databases Lotus Notes® Domino databases and popular Web search engines (via IBM Lotus Extended Search) This product also supports Information Mining and Web Crawling, such that it provides an API for the automation of information extraction and analysis, and provides for intranet, extranet, or Internet Web crawling. More details on the DB2 Information Integrator for content can be found within the various documents and whitepapers available at: http://www-3.ibm.com/software/data/Information Integrator for Content/library.html 6.2.3 Lotus Domino Lotus Domino provides a multiplatform foundation for collaboration and e-business, driving solutions from corporate messaging to Web based transactions — and everything in between. This enterprise-class messaging and collaboration system is built to maximize human productivity by unleashing the experience and expertise of individuals, teams, and extended communities. Chapter 6. Portal Search product mappings 101 In terms of portal search solutions, Lotus Domino provides a basic search and indexing engine. Administrators can create full-text indexes to allow users to quickly search for information in databases. To search in a database, users enter a word or phrase in the search bar of the database to locate all documents containing the word or phrase. Additionally, Domino also supports a capability called “Domain Search”, which supports searching across and entire Domino domain of databases and files. To support Domain Search, you need to designate a Domino server as the indexing server, which builds a domain wide index that all Domain Search queries run against. In order for the indexing server to build the index, you must first create a Domain Catalog on the server — a database that controls which databases and file systems get indexed. The indexing server then spiders, or crawls, the servers that contain the content to be indexed. When a user submits a query, the results that the indexing server returns contain only database documents to which that user has appropriate access. If the indexing server is set up as a Domino Web server, it can support searches from both Lotus Notes and Web browsers. More information on Lotus Domino, and the Domino Domain search, can be found within the Lotus Domino product documents found at: http://www.lotus.com/ldd/doc 6.2.4 Lotus Discovery Server The IBM Lotus Discovery Server is a knowledge server that provides advanced search and expertise location solutions designed to ensure that all of the relevant knowledge and collective experiences of an organization are readily available to help individuals and teams solve every day business problems. To do this, the Lotus Discovery Server extracts, analyzes and categorizes structured and unstructured information to reveal the relationships between the content, people, topics and user activity in an organization. It will automatically generate and maintain a Knowledge Map (K-map) to display relevant content categories and their appropriate hierarchical mapping that can easily be searched or browsed by users. The server also generates and maintains user profiles and tracks relevant end-user activities, identifying those individuals who may be subject matter experts. More details on Lotus Discovery Server can be found at: http://www-3.ibm.com/software/lotus/knowledge/ http://www.lotus.com/products/discserver.nsf 102 Patterns: Portal Search Custom Design More information can also be found within the IBM Redbook Lotus Discovery Server 2.0 Deployment, Planning, and Integration, SG246575: http://www.redbooks.ibm.com/abstracts/sg246575.html 6.2.5 WebSphere Application Server IBM WebSphere Application Server provides Web and application server services in an e-business environment. It supports custom-built applications, based on integrated WebSphere software platform products, or on other third-party products. Such applications can range from dynamic Web presentations to sophisticated transaction processing systems. WebSphere Application Server is leading the way in support for industry open standards. WebSphere Application Server provides full Java 2 Platform, Enterprise Edition (J2EE) compliance with a rich set of enterprise Java open-standards implementations. It also provides built-in support for the key Web services open standards, making it production-ready for the deployment of enterprise Web services solutions. The latest version of WebSphere Application Server is version 5. More details on WebSphere Application Server can be found at: http://www.ibm.com/websphere More information can also be found within the IBM Redbook: WebSphere Application Server V5.0 System Management and Configuration, SG24-6195: http://www.redbooks.ibm.com/abstracts/sg246195.html 6.2.6 WebSphere Portal WebSphere Portal is IBM’s comprehensive portal offerings for successful business-to-employee (B2E), business-to-business (B2B) and business-to-consumer (B2C) portals. WebSphere Portal: Delivers a single, point of personalized interaction with applications, content, processes, and people for a unified user experience Allows users to view, search, create, convert, and edit basic documents, spreadsheets, and presentations from within the portal Provides powerful collaboration capabilities such as instant messaging, team workplaces, people finder and e-meetings Enables quick portal integration with back-end systems via portlet builders Chapter 6. Portal Search product mappings 103 The latest version of WebSphere Portal is Version 5.0. WebSphere Portal 5.0 for Multiplatforms. It includes two offerings: Portal Enable is the base offering, and provides personalization, content publishing, document management, productivity functions along with the scalable portal framework. Portal Extend adds powerful collaborative, extended search (via Lotus Extended Search) and Web analysis features to enhance portal effectiveness. There are also small business focused versions of WebSphere Portal, and versions for IBM ^™ zseries and iseries platforms: WebSphere Portal — Express WebSphere Portal Enable for iSeries™ WebSphere Portal for z/OS® and OS/390® More details on WebSphere Application Server can be found at: http://www7b.software.ibm.com/wsdd/zones/portal/ 6.2.7 WebSphere Portal Search Engine (Juru) WebSphere Portal Version 5.0 provides a Portal Search Engine to facilitate indexing and searching of information. Starting with WebSphere Portal version 4, it has included a built-in search engine that crawls and indexes Internet and text documents. This consists of a search portlet that provides fast and precise free-text search (with ranking by relevance or by date at user’s request) as well as an “admin” portlet for easy configuration. In addition, it provides a SOAP interface for Web services enablement. The search technology included in WebSphere Portal is based on a search technology originally developed by IBM Research, and code-named Juru. Juru is a full-text search library entirely written in Java that focuses on highly precise search results. It efficiently applies state-of-the-art search algorithms as well as unique techniques to produce both effective and efficient results. The use of Java and Internet technology (servlets, templates, SOAP etc.) allows easy integration with cross-platform applications. It also enables developers to incorporate new document types and to easily develop new user interfaces. Juru's basic search features include: free-text query specification, advanced query operators, multi-lingual support, summarization, customized thesaurus (for example, synonyms) and stop list, search results clustering, and index compression. Some of Juru's unique features include: multiple word indexing (Lexical Affinities) for disambiguation and high precision, query assistance word/query completion, utilization of link information for Intranet search, and lossy index compression to help choose the ideal size of your index. 104 Patterns: Portal Search Custom Design The Juru based WebSphere Portal Search Engine can scale up to 100 GB worth of index data on a single server. As the technical scenario in this redbook is based on WebSphere Portal v4, this technology was not utilized — due to incompatibilities with some of the other technologies that were implemented (that is, Lotus Extended Search did not support WebSphere Portal Search Engine until Lotus Extended Search 4.0). However, this technology should be considered a key component of any WebSphere Portal version 5.0 based portal search implementation. 6.3 Choosing the product For each node in our overall Runtime pattern, a set of products and technologies are known to provide a correct mix of scalability, maintainability, and extensibility. In numerous client engagements, multiple products have proven to meet demands of the target environment. In this section, we will highlight the products applicable to each node in our “overall” custom design Runtime pattern, by discussing the product mappings for the specific Runtime patterns that each high level node represents. The specific product mappings we will examine are: Population: Index Population — as this maps to our overall “indexing” node Federation — as this maps to our overall “federation” node User Search and Discovery — as this maps to our overall “searching” node Population choices Figure 6-4 shows the Population Runtime pattern. Product Mappings Runtime pattern Lotus Domino WebSphere Portal Search Engine Lotus Extended Search (web crawler) DB2 Information Integrator for Content (web crawler) Lotus Discovery Server (spiders) Population Data Server / Services Data Server / Services Figure 6-4 Product mappings for Index Population Runtime pattern Chapter 6. Portal Search product mappings 105 As depicted, the Population: Index Population Runtime pattern has multiple product mapping choices, as multiple IBM products perform basic index creation capabilities. Ultimately, the product you choose will depend on the environment into which your are going to deploy the solution, and the types of data you are trying to index: Lotus Extended Search ships with a basic Web crawler, that can be used for small basic Web indexing needs. DB2 Information Integrator for Content has a Web crawler API, which allows one to develop their own custom Web crawler. WebSphere Portal has a high quality search engine based on the IBM Juru technology. It would be most appropriately used in WebSphere Portal environments. Lotus Domino has always had a robust data indexing engine, which easily handles the creation of indices of Domino/Notes® data. Lotus Discovery Server providers built in “spiders” to crawl and perform advanced indexing and categorization of multiple types of sources — as well as an API for building customer “spiders”. It is most often used when advanced taxonomy generation and expertise location capabilities are needed. Federation choices Figure 6-5 shows the Federation Runtime pattern. Product Mappings Runtime pattern WebSphere DB2 Information Integrator Lotus Extended Search for Content Data Server / Services Data Server / Services Data Integration Figure 6-5 Product mappings for Federation Runtime pattern 106 Patterns: Portal Search Custom Design IBM Content Manager DB2 & other Data Server / Services As depicted, the Federation Runtime pattern also has multiple product mapping choices — although less choices than with Population. The sole IBM product that provides the Data Integration capabilities is the DB2 Information Integrator and DB2 Information Integrator for Content (II4C). We will focus on the “content” version, as the focus of this book is searching of unstructured content. When II4C is used, IBM Content Manager and/or DB2 or other relational databases server as the back-end data service. The front end calling application would then be any application written to the II4C API — be it a native C based application or a Java based application running in WebSphere. Additionally, Lotus Extended Search provides a connector to allow it to leverage a federated data source through Information Integrator within one of its queries. Search choices When determining which technologies to map to the search node, we need to first decide which variation of the User Search and Discovery application pattern to utilize. This can be either a basic single data source search, or a more “extended/federated” search across multiple data indices. Once this is decided, then the product mappings shown in Figure 6-6 apply. Product Mappings Runtime pattern WebSphere App Server WebSphere Portal Lotus Domino Browser Web Application Server WebSphere Portal Search Engine (Juru) Lotus Domino Search Search files NSF Data Server / Services Figure 6-6 Product mappings for basic User Search and Discovery In the base User Search and Discovery pattern, the choices are really defined b the user interface. That is, if the user is in Domino, then traditionally the Domino search capabilities will be utilized against a back-end NSF data store. Alternatively, if the user is in a WebSphere application, or WebSphere Portal, then the WebSphere Portal Search Engine will be used with the Portal Seach Engines index previously built by PSE in various files on the file system. Chapter 6. Portal Search product mappings 107 Product mappings for User Search and Discovery are shown in Figure 6-7. Product Mappings Runtime pattern WebSphere App Server (LES & II4C) Lotus Domino (LDS) WebSphere Portal (all) Browser Web Application Server Lotus Extended Search (LES) DB2 Information Integrator for Content (II4C - formerly EIP) Search Search Adapter DB2 Files nsf Content Manager etc Data Server / Services Figure 6-7 Product mappings for User Search and Discovery w/Search Adapters However, when a more extended search version of the User Search and Discovery pattern is used (with search adapters), there are really two key technology choices. The selection of one technology over the other is usually made based on the data to be searched: Lotus Extended Search (LES): This is most often used for search across more typically “unmanaged” sources of “collaboration” data, such as Lotus Domino, Domino.Doc, MS Index Server, MS Sharepoint Portal,Quickplace 3.0, Lotus Discovery Server, WebSphere Portal Search Engine, MS Site Server, MS Exchange, MS SQL Server, MS Access, public Web search sites and syndicated content providers, and ODBC data. IBM Information Integrator for Content (II4C): This is most often used when searching is needed across traditional IBM “managed content” sources, such as IBM DB2 Content Manager, Content Manager OnDemand, ImagePlus®, EDMSuite™ VisualInfo™, and Lotus Domino.doc. To make matters more confusing though, both LES and II4C provide connectors to leverage each other. Thus, one can use LES as the search “broker”, including more traditionally “managed content” sources in its search via a connector to II4C. Alternatively, II4C can connect to LES to include more unmanaged file and collaborative data. 108 Patterns: Portal Search Custom Design In such cases, where the type of data to be searched does not clearly define the technology to select, or aspects such as the varying APIs for each option may come into play. For example, LES comes with more ready to use “out of the box” interfaces, while the II4C toolkit/API must really be leveraged to build a custom interface for usage with II4C. One other final consideration is the ability to update data from the seach results. II4C allows for edit and manipulation of resulting documents within the back-end managed content sources. Note: In this section we have only discussed the key decision criteria in choosing the right product mapping for seach specific nodes. For more details on choosing the right product choice for the portal specific nodes of the Portal Search custom design, please reference the IBM Redbook Patterns: A Portal composite pattern Using WebSphere Portal V4.1.2, SG24-6869. 6.4 Summary This chapter has taken our discussion of the Portal Search custom design down to its final level. The high level business and integration patterns we originally discussed when first introducing the need for portal search, have now been taken down to specific product mapping recommendations. In the rest of this redbook, we will now discuss technical guidelines for using and implementing these technologies. Chapter 6. Portal Search product mappings 109 110 Patterns: Portal Search Custom Design Part 3 Part 3 Solution guidelines © Copyright IBM Corp. 2004. All rights reserved. 111 112 Patterns: Portal Search Custom Design 7 Chapter 7. Technology considerations When selecting any search engine or technology, there are many important aspects to consider beyond just whether it can support the data sources that you require it to search. This chapter attempts to define the key questions and technology aspects that must be considered when determining the technical implementation of a search solution. © Copyright IBM Corp. 2004. All rights reserved. 113 7.1 Query syntax support Nearly all data management systems employ a grammar or query language of some kind to express the criteria of a search. These grammars can vary widely depending on the structure and composition of the data. In free text systems such as the Web, for example, the search is generally expressed as a list of keywords. Additional notations are used to express boolean conditions (and, or, not) or positional information, such as specific words that must occur within the same sentence or paragraph. However, if the data is highly codified and structured, the grammar may be more parametric and may support fielded operations (for example, the value of the Quantity field is greater than 100). So, when a search technology is being investigated, the potential power of its search language should be carefully considered. Does it support boolean searches, field level searches, exact phrase searches, etc? There are currently efforts underway in the various standards bodies to investigate standards for a common search query syntax, similar to the SQL standard utilized in the relational database world. Query syntax in an extended search world When an “extended” search solution is utilized, it is clearly impractical for a user to know the syntax used by each brokered search source. It is instead much more practical to let the user express a query in a single common search language, which is then in turn mapped to the specific native query syntax utilized by each of the brokered search sources. As an example, the Lotus Extended Search products offers a “common” language, which it refers to as the Generalized Query Language (GQL).This GQL is basically a superset of search grammars from which most queries can be expressed. Even with such a common syntax solution, there will be occasions in which a specific query “expression” will simply not map to a back-end search source, as that back-end search source does not have a matching search capability. For example, a field level search may not be supported by all search sources. Alternatively, a single search source may support more advanced search syntax expressions than are supported by the extended searches common query language. In these cases, the ability for the extended search technology to support the “passthrough” of a query is important, such that it skips the translation step and runs “as is” against the target data/search source. 114 Patterns: Portal Search Custom Design Additionally, there may also be occasions in which the “translation” utilized by extended search product to transform the common query into the specific data sources query syntax may not be sufficient. One may need to add and revise the defined translation rules to include enterprise-specific rules for customized translator programs (for example, to access enterprise-proprietary databases). Thus, the capability to modify these rules is an important feature to consider. In the Lotus Extended Search product, the specification of each supported grammar is recorded in a grammar definition in the products configuration database. Grammar definitions contain an entry that associates a shared library with a particular Extended Search grammar. The specified library contains the actual code for translating a GQL statement into grammar native to the link type, such as rules for translating GQL to SQL — and these definitions can be modified or created from scratch as required. 7.2 Support for a common data model Just as search grammars can vary with each dissimilar back-end system, so can the data models used to organize and store information. The data model used by a particular data management system is typically designed for the class of applications it serves. This determines the amount of structure and granularity found in its information. For example, free text systems tend to use a loosely structured document model with low data granularity. A document may consist of a few fields (such as title, author, and body) but its text remains free in form and unstructured. By comparison, information can be highly structured, such as that found in relational databases. Here, data is organized into rows and columns that can be related in any number of ways, which results in high data granularity. A search solution that provides for extended searching across multiple data sources must consider these diverse data models, and determine how to normalize these models into a single common data model so that search results can easily be aggregated. A common model should not attempt to achieve a full union of all the back-end data models but rather provide a flexible form into which all models can map most of their concepts. One important feature enabled by such a common data model is field level searches via field mapping. Field mapping A common problem encountered when relating data sources of different types is the mismatch in field labels. For example, an author’s name might be labeled AUTH_NAME in one data source and CREATOR in another, and yet be represented as three fields (such first name, middle initial, and last name) in another. Chapter 7. Technology considerations 115 An important feature of a common data model is the ability to define mapped fields. A mapped field is a composite of one or more native fields. To resolve the ambiguity in our author name example, you could define a single mapped field with the label AUTHOR. You could then map this field to one or more native fields in each of the data sources that support the semantic of author’s name. The benefits of this mapped field feature in an extended search technology are compelling when used in a search expression. A user needs only to specify the mapped field in the query, and the search server will automatically associate the mapped field to the correct native fields on the back-end. This approach greatly simplifies the query expression, and provides greater benefit as the number of data sources increases. Not only do mapped fields help in simplifying the search expression, they can also be used to simplify the processing of search results by the extended search engine. For example, if the result came from a personnel record in an LDAP directory, you might want to return the person’s name, job title, and contact information in the search results list. On the other hand, if the result came from an e-mail system, you might want to return the date, subject, and author. These different pieces of document metadata could both be mapped to the same common fields, and be requested from the document index with the search results. 7.3 Simple versus advanced index creation As discussed in the application patterns discussion earlier in this redbook, a search product that builds search indices can function at multiple levels. It can perform a basic index creation, which allows for a simple text based search of source data, or it can provide more advanced capabilities such as summarization and categorization. It is becoming increasingly common these days for search technologies to provide more than just index creation, so this is another key criteria in selection a search technology. Summarization Summarization techniques provide for a short descriptive text “summary” of a document within the resulting index. This allows search results to present a small summary of text to the user, allowing them to better decide if a given search results is applicable to them or not, without having to click through and view the actual document. Basic summarization techniques can extract document summaries that are already included and marked as such within a document. Obviously, it is easier to attach a summary to index entries if selected documents contain a summary that is clearly marked. This is often the case with HTML based documents that may 116 Patterns: Portal Search Custom Design contain summaries in the HTML headers. However, when a pre-written summary does not exist, one can also be extracted via other electronic means. Electronic extraction of document summaries can also occur in a simple fashion, such as using the first XX characters or sentences in a document as the summary. In more advanced cases, summarization processes can intelligently analyze the document to determine important sentences that should be included within the summary. The importance of a sentence is determined by some surface clues such as the number of important keywords, the type of sentence (fact, conjecture, opinion, etc.), rhetorical relations in the context, and the location in which a sentence exists in a document. Such advanced summarization extraction capabilities can also be taken a step further in that a summarization method might be changed by text types. For example, one would use different strategies of summarization for ordinary articles versus editorial articles in newspapers. This corresponds to a change in the weight (or the importance) of each surface feature when calculating sentence importance. Categorization When document categorization capabilities are involved, documents are sorted into a list of categories that group like documents together. Such categories can allow users to browse to documents that might meet there interest, rather than just entering search criteria. Such categories can be grouped into a hierarchical structure, forming an overall “taxonomy” for all indexed documents. Like summarization techniques, categorization techniques can also vary widely. At the basic level are techniques that provide a simple rules based categorization. An administrator defines the categories, and the rules that documents must map to fit into a given category. For example, all documents with the words “Redbook” and “IBM” and IBM occurring often in the first part of a document would map to the “IBM Redbooks” category. More advanced categorization and taxonomy generation technologies also exist. These technologies will attempt to extract key “features” of documents, cluster documents with common “features” together, and then determine the appropriate category names for these clusters of documents. The Lotus Discovery Server is one example of such technology, although the Discovery Server product takes this a step further by including capabilities to match the authors of documents to categories as well, providing for an “expertise location” type of capability. In real life usage, many of these “automatic” taxonomy generating products still need to be verified and cleaned up by a document specialist that is familiar with the subject matter. Thus, for such “automatic” taxonomy and category generating products, strong taxonomy “editing” tools are an important consideration to allow the resulting taxonomies to be edited and modified as needed. Chapter 7. Technology considerations 117 Multi-language support When selecting a product due to its more advanced summarization and categorization capabilities, do not forget to first consider whether any of the data sources to be searched involve multiple languages. The more advanced “text mining” features utilized by summarization and categorization engines will not always support any language beyond english. When they do, they will often have separate “linguistic analysis” processes that must run depending on the language involved. In cases such as this, it is important to identify if the language of data is fixed for each source, or if the data within a single source is a mix of languages. For example all data in this source is in german, while all data in a second source is in english — versus a mix of english and german within the same database. When this mix of languages is in place, search technologies must have a language identification capability that will let them identify the language of a given document, and then run the appropriate summarization and categorization tools for that language on that document. 7.4 Honoring the security of data sources Another important consideration for any search technology is its ability to honor the security of the data sources that are searched. This becomes particularly complex when document level security is enabled on the data source. There are two main areas to consider in terms of search security: security during indexing, and security during searching. Such “security” features of search technologies are often overlooked, but are a crucial product selection criteria. Honoring of data security during indexing Search technologies that create an index of document data must ensure that security details of documents are brought with the documents into the index as part of the documents metadata. This is the only way that the searching of this index can honor the original documents security. A key deployment consideration associated with this is ensuring that the credentials the indexing process uses to access the data source has access to all data that is desired to be indexed. This is again sometimes a complicated matter when document, or record, level security is enabled on the data. Honoring of data security during searching When data involved in your search solution is of a sensitive nature, it is crucial that users are never returned results for documents to which they do not have access. For one, receiving a result set you cannot access is not a quality solution in terms of user satisfaction. But more importantly, the search results may contain document summarize or contents of certain document fields such that 118 Patterns: Portal Search Custom Design user would view in the search results sensitive information that they would otherwise not have access to. Thus, the searching capabilities of any search technology should be able to “impersonate” the user making the search request, and pass these user credentials on to the back-end data sources being searched. This will ensure that the only results received are results for which the user has access. A twist on this same scenario would be a back-end data source that contains highly personalized data. In this case, the ability of a search engine to impersonate the user making the request is important not so much for security reasons but for user satisfaction reasons. If the user is used to browsing information and viewing personalize information while browsing, they will probably expect a search capability over these same information to provide for personalize results as well. 7.5 Source discovery When deploying any search technology, the initial setup of connecting the search engine to the data sources to be searched can be quite time consuming. This is especially so when extended search technologies are involved, and thus multiple diverse data sources may need to be included. In such cases it is helpful if the search product allows for the automatic “discovery” of details about data sources, and then automatically configures required settings, any field mappings, and other parameter information for each new data source. Any such “discovery processes” should also be able to ascertain whether or not a particular data source has been previously loaded into the search configuration — in which cases the discoverer should skip already defined sources on subsequent invocation. This should be true even if the data source name changes. 7.6 Performance considerations Performance and scalability of any e-business solution can sometimes be considered a “black art”. There are any number of network, hardware, software, or environmental considerations that can affect an overall systems performance. However, when considering search technologies to utilize in a search solution, several key features or metrics of the search engines should be considered. Index size: Any search engine has a limit in the amount of data it can index, and the size of the index itself, before it will reach a poor level of performance. Chapter 7. Technology considerations 119 Additionally, the size of the index also needs to be considered to ensure that adequate disk resources are available. In many search technologies, the size of the index can be 50-75% the size of the original data! Crawl rates: Search engines will also have limitations in terms of how quickly they can crawl through source data sets, and perform the necessary indexing and categorization steps. Obviously, the more advanced the summarization and categorization techniques applied, the slower the search engines crawl rate will be. Crawl rates are usually expressed in terms of documents per hour — but it is important to understand the average document size that manufactures are making when determining such statistics. Caching: Caching of data is another common aspect considered in the performance of any IT system. When applied to search, caching can fall into several categories. First, are searches against a datasoruce cached, such that additional searches for the same search term utilize the cached results? Such global search caching obviously needs to take into account the update period for the search indices/etc., to ensure that cached search results do not begin providing out-of-date information. However, another equally important caching consideration relates specifically to the performance of the search client interfaces. If a users receives 1000 search results, then it is obviously not ideal to return all of these results directly to the client, in terms of ensuring the quickest response to the end user. Typically, only the first X (10, 50, 100, etc) set of results are returned, and the rest are cached within the presentation module of the search engine. When the user then asks for the next “page” of search results (like 50-100), the search engine would simply serve this next page of results from cache. Componentized architecture: Probably the most important aspect of any search technology is the flexibility and compotentization of its overall architecture. This is the most important aspect in terms of performance. For example, if the search engine has a relatively slow crawl rate, but has a componentized crawl engine that allows for multiple “crawlers” to be executing at the same time, then the slow crawl rate of a single crawler may be acceptable. Common capabilities that should be available as separate “components” to aid in performance and scalability are: – Crawling/indexing — This is necessary so that multiple data sources can be crawled and indexed at the same time. 120 Patterns: Portal Search Custom Design – Search engine — Even if a single search engine is efficiently multi-threaded, supporting multiple instances of the search engine to handle user search requests increases scalability of the implementation so that multiple searches can be processed at the same time. – Client interface — The client interface should ideally be separate from the search engine. This allows the load of presentation and user formatting logic to be moved to separate hardware than the hardware utilized for the search engine itself. This also allows for better maintainability in that the user interface can be modified without affecting the underlying search implementation. In the case of an extended search capability, some additional key components would be: – Search broker — Support for multiple search brokers instances increases the scalability of the implementation in the same manner as do multiple search engines in a non-extended search product. – Search connectors — Search connectors allow for communication between the search brokers and individual data source search engines — to allow for modular plug and play of additional search sources. Support for multiple connectors to the same data source can further distribute the load of search requests. For more details on these components, and ideal “runtime” architectures, please see Chapter 5, “Runtime patterns” on page 67. 7.7 Client features When considering aspects of a good search client, the flexibility of this search client to provide more than just a basic search result set are another important consideration. Next we offer some questions to ask when considering the capabilities of clients provided with a search product. What level of detail is available in results? After a user performs a search, the more detail they are provided with in the search results, the easier it will be to determine the relevancy of any resulting documents found. For example, is just a basic document title included in the search result, or is a document summary also included? Even more, can specific fields from the document metadata be shown (creation date, author, etc.) to further refine and clarify the search results. Chapter 7. Technology considerations 121 How are results from multiple sources handled? When an extended search technology is involved, and thus searches are spanning multiple data sources, how are these results from multiple sources presented to the user? One option is for the results to be displayed broken down by data source. In other words, users would see all results for datasource1, then all results for datasource2, and so on. Another option that provides additional advantages to the user is to present in a single consolidated view, with results aggregated and ranked into one list. However, such an aggregate solution does introduce additional complexities, such as how the diverse rankings of each data source are identified. For example, a user may search on the term “redbook”. The search against the first data source may return many hits with a ranking of “80%”, while a second data source may user different criteria in determining its rankings and label similar results as having an accuracy of 60%. When the results of these sources are aggregated into a single list, this different in ranking must be accounted for. Thus, many extended search technologies will support the assignment of “weights” to the rankings returned by a source. In our example, the results for the first datasoruce would be given a lower weighting than that of the second data source, thus resulting in an aggregated list with similar percentage rankings of the results. Of course, there may be situations in which support for both methods described here is required in a given solution! Therefore, the most ideal solution would provide support for results broken down by source, and consolidated into a single list. Can the original document be “fetched” if needed? After viewing a set of search results, and selecting a specific document to view in more detail, how does the user gain access to that original document? For Web based search clients, users typically select a result item (identified by a URL), and the Web browser renders the content in accordance with the MIME type set for the document. For example, the browser might use Microsoft Word to render documents that have the file extension .doc). However, in some cases, such as with documents stored in file systems or in relational databases, the search technology will need to retrieve the requested document from the back-end data source — and then render it in some format that is usable by the user. 122 Patterns: Portal Search Custom Design Does the search client support saved searches? For searches that a user may need to repeat on a regular basis, the ability to provide for saved searches is important. This is especially true if the search technology allows for searches with an advance search query syntax — as such searches, if not saved, may be difficult for the user to recreate. Support for the saving of search queries can be extended even more to allow for sharing of saved searches with other users, or allowing for the scheduling of searches to run on a repeated basis — perhaps with search results emailed to the user. Another aspect of saving searches to consider is the saving of search results. The result set may also need to be shared with other users, and providing a mechanism for users to save search results in multiple formats would allow for this reuse. Common methods for the saving of search results are PDF, MS Word, or even XML. Without such capabilities users may forward around links to search queries, which would run the search again each time a user follows the link — potentially impacting the performance of the search server. There are obvious ways to handle these capabilities programmatically in any search client that is created. However, having saved search capabilities supported within the back-end of the search engine itself eases development effort and time for any search clients. Can results sets be dynamically used? One final consideration for features in the search client is whether users have the ability to utilize the search results to perform further analysis. For example, can the users search with the search results — thus further refining their search? 7.8 Client technologies As we are considering portal related search solutions in this redbook, the main capability required in terms of a client technology is support for standard “Web” based capabilities that can be “surfaced” within the context of a portal — via a “portlet” within the portal. Specific guidelines on portlet development are provided in Chapter 8, “Application design” on page 133 later in this redbook. However, at a high level, a “portlet” is itself just a basic Web-based e-business application. Thus, any search engine should support multiple common Web technologies to provide the maximum flexibility in integrating the search technology into the portal. Some common Web technologies that should be supported are: HTML, the basis of any e-business Web application Chapter 7. Technology considerations 123 Dynamic HTML, JavaScript, and Java Applets for enhancing the user experience on the client Java Servlets, Java Beans, and Java Server Pages (JSP) to provide for server side processing and logic When these technologies are put together, the common process for a Web based e-business application is as follows: An HTML client interacts with a Web application server by using the HTTP protocol. The Web server processes the request via a “server side” technology such as Java Servlets or Java Server pages. Any search technology would ideally be integrated at this point, by providing strong support for standard J2EE components, such as Java Beans and JSP Tags, to easily access search capabilities from server side logic. The server then returns a new HTML page to the client as a response to the original request, again responding via HTTP. The new HTML page can contain Java applets, JavaScript, or dynamic HTML (DHTML) for enhancing the presentation to the user. An alternative to the foregoing process would be using an Extensible Markup Language (XML) based “Web Services” solution to facilitate communication with the server. In this model: Messages are packaged for sending to a server via structured XML, within a SOAP (Simple Object Access Protocol) “envelope”. The search service is located via the usage of Web Services Description Language (WSDL) and Universal Description, Discovery and Integration (UDDI) capabilities. The message is then sent to the Web service over HTTP, within a SOAP message. The search technology would implement this “Web service” and would then process the request, returning a response to client via another SOAP message. The client would then manipulate this response by parsing the XML, and using standard XML APIs and tools to present the data to the user. Ideally, any search technology chosen would support both models of development (that is, standard HTML/Java and SOAP/XML) for maximum solution flexibility. All of these technologies are well known for the development of any e-business application, and some guidelines for their usage are described within the rest of this section. 124 Patterns: Portal Search Custom Design 7.8.1 HTML HTML (HyperText Markup Language) is a document markup language with support for hyperlinks that is rendered by the browser. It includes tags for simple form controls. Many e-business applications are assembled strictly using HTML. This has the advantage that the client-side Web application can be a simple HTML browser, enabling a less capable client to execute an e-business application. The HTML specification defines user interface (UI) elements for text with various fonts and colors, lists, tables, images, and forms (text fields, buttons, checkboxes, and radio buttons). These elements are adequate to display the user interface for most applications. The disadvantage, however, is that these elements have a generic look and feel, and lack customization. As a result, some e-business application developers augment HTML with other user-interface technologies to enhance the visual experience, subject to maintaining access by the intended user base and compliance with company policy on Web client technologies. Because most Web browsers can display HTML V3.2, this is the lowest common denominator for building the client side of an application. To ensure compatibility, developers should be unit testing pages against a validator tool. Free tools, such as the W3C HTML Validation Service, are available at: http://validator.w3.org/ 7.8.2 Dynamic HTML DHTML allows a high degree of flexibility in designing and displaying a user interface. In particular, DHTML includes Cascading Style Sheets (CSS) that enable different fonts, margins, and line spacing for various parts of the display to be created. These elements can be accurately positioned using absolute coordinates. Another advantage of DHTML is that it increases the level of functionality of an HTML page through a document object model and event model. The document object enables scripting languages such as JavaScript to control parts of the HTML page. For example, text and images can be moved about the window, and hidden or shown, under the command of a script. Also, scripting can be used to change the color or image of a link when the mouse is moved over it, or to validate a text input field of a form without having to send it to the server. Chapter 7. Technology considerations 125 Unfortunately, there are several disadvantages when using DHTML. The greatest of these is that two different implementations (Netscape and Microsoft) exist and are found only on the more recent browser versions. A small, basic set of functionality is common to both, but differences appear in most areas. The significant difference is that Microsoft allows the content of the HTML page to be modified by using either JScript or VBScript, while Netscape allows the content to be manipulated (moved, hidden, shown) using JavaScript only. Due to varying levels of browser support, cross-browser design strategies must be used to ensure appropriate presentation and behavior of DHTML elements. In general, this technology is not recommended unless its features are needed to meet usability requirements. Additionally, DHTML is not supported within a “portlet” application. 7.8.3 JavaScript JavaScript is a cross-platform object-oriented scripting language. It has great utility in Web applications because of the browser and document objects that the language supports. Client-side JavaScript provides the capability to interact with HTML forms. You can use JavaScript to validate user input on the client and help improve the performance of your Web application by reducing the number of requests that flow over the network to the server. ECMA, a European standards body, has published a standard (ECMA-262) that is based on JavaScript (from Netscape) and JScript (from Microsoft), called ECMAScript. The ECMAScript standard defines a core set of objects for scripting in Web browsers. JavaScript and JScript implement a superset of ECMAScript. To address various client-side requirements, Netscape and Microsoft have extended their implementations of JavaScript in version 1.2 by adding new browser objects. Because Netscape's and Microsoft's extensions are different from each other, any script that uses JavaScript 1.2 extensions must detect the browser being used, and select the correct statements to run. One caveat is that users can disable JavaScript on the client browser, but this can be programmatically detected. 126 Patterns: Portal Search Custom Design 7.8.4 Java applets The most flexibility of the user interface (UI) technologies that can be run in a Web browser is offered by the Java applet. Java provides a rich set of UI elements that include an equivalent for each of the HTML UI elements. In addition, because Java is a programming language, an infinite set of UI elements can be built and used. There are many widget libraries available that offer common UI elements, such as tables, scrolling text, spreadsheets, editors, graphs, charts, and so on. You can use either the Java AWT or Swing classes to build a Java applet. But while designing your applet, you should keep in mind that Swing is supported only by later browser versions. A Java applet is a program written in Java that is downloaded from the Web server and run on the Web browser. The applet to be run is specified in the HTML page using an APPLET tag: <APPLET CODEBASE="/mydir" CODE="myapplet.class" width=400 height=100> <PARAM NAME="myParameter" VALUE="myValue"> </APPLET> For this example, a Java applet called “myapplet” will run. An effective way to send data to an applet is by using the PARAM tag. The applet has access to this parameter data and can easily use it as input to the display logic. Java can also request a new HTML page from the Web application server. This provides an equivalent function to the HTML FORM submit function. The advantage is that an applet can load a new HTML page based upon the obvious (a button being clicked) or the unique (the editing of a cell in a spreadsheet). A characteristic of Java applets is that they seldom consist of just one class file. On the contrary, a large applet may reference hundreds of class files. Making a request for each of these class files individually can tax any server and also tax the network capacity. However, packaging all of these class files into one file reduces the number of requests from hundreds to just one. This optimization is available in many Web browsers in the form of either a JAR file or a CAB file. Netscape and HotJava support JAR files simply by adding an ARCHIVE="myjarfile.jar" variable within the APPLET tag. Internet Explorer uses CAB files specified as an applet parameter within the APPLET tag. In all cases, executing an applet contained within a JAR/CAB file exhibits faster load times than individual class files. While Netscape and Internet Explorer use different APPLET tags to identify the packaged class files, a single HTML page containing both tags can be created to support both browsers. Each browser simply ignores the other's tag. Chapter 7. Technology considerations 127 JavaScript can be used to invoke methods on an applet using the SCRIPT tag in the applet’s HTML page. A disadvantage of using Java applets for UI generation is that the required version of Java must be supported by the Web browser. Thus, when using Java, the UI part of the application will dictate which browsers can be used for the client-side application. Note that the leading browsers support variants of the JDK 1.1 level of Java and have different security models for signed applets. Using Java plug-ins, you can extend the functionality of your browser to support a particular version of Java. Java plug-ins are part of the Java Runtime Environment (JRE) and they are installed when the JRE is installed on the computer. You can specify certain tags in your Web page, to use a particular JRE. This will download the particular JRE if it is not found on the local computer. This can be done in HTML through either of these tags: The conventional APPLET tag The OBJECT tag, instead of the APPLET tag, for Internet Explorer; or the EMBED tag with the APPLET tag for Netscape. A second disadvantage of Java applets is that any classes such as widgets and business logic that are not included as part of the Java support in the browser must be loaded from the Web server as they are needed. If these additional classes are large, the initialization of the applet may take from seconds to minutes, depending upon the speed of the connection to the Internet. Using HTTP tunneling, an applet can call back on the server without reloading the HTML page. For users who are behind a restrictive firewall, HTTP tunneling offers a bidirectional data connection to connect to a system outside the firewall. Because of the above shortcomings, the use of Java applets is not recommended in environments where mixed levels and brands of browsers are present. Small applets may be used in rare cases where HTML UI elements are insufficient to express the semantics of the client-side Web application user interface. If it is absolutely necessary to use an applet, care should be taken to include UI elements that are core Java classes whenever possible. 7.8.5 Java servlets Servlets are Java-based software components that can respond to HTTP requests with dynamically generated HTML. Servlets are more efficient than CGI for Web request processing, since they do not create a new process for each request. 128 Patterns: Portal Search Custom Design Servlets run within a Web container as defined by the J2EE Model and therefore have access to the rich set of Java-based APIs and services. In this model, the HTTP request is invoked by a client such as a Web browser using the servlet URL. Parameters associated with the request are passed into the servlet via the HttpServletRequest, which maintains the data in the form of name/value pairs. Servlets maintain state across multiple requests by accessing the current HttpSession object, which is unique per client and remains available throughout the life of the client session. Acting as an “controller” component, a servlet delegates the requested tasks to beans that coordinate the execution of business logic. The results of the tasks are then forwarded to a “view” component, such as a JSP, to produce formatted output. One of the attractions of using servlets is that the API is a very accessible one for a Java programmer to master. The specification of the J2EE 1.3 platform requires Servlet API 2.3 for support of packaging and installation of Web applications. Servlets are a core technology in the Web application programming model. They are the recommended choice for implementing the logic that handles HTTP requests received from a Web client. 7.8.6 JavaServer Pages (JSPs) JSPs were designed to simplify the process of creating Web pages by separating the Web presentation from Web content. In the page construction logic of a Web application, the response sent to the client is often a combination of template data and dynamically generated data. In this situation, it is much easier to work with JSPs than to do everything with servlets. The JSP acts as the View component in a standard Model View Controller (MVC) programming model. The chief advantage JSPs have over standard Java servlets is that they are closer to the presentation medium. A JavaServer Page is developed as an HTML page. Once compiled, it runs as a servlet. JSPs can contain all the HTML tags that Web authors are familiar with. A JSP may contain fragments of Java code that encapsulate the logic that generates the content for the page. These code fragments may call out to beans to access reusable components and enterprise data. JSP technology uses XML-like tags and scriptlets written in Java programming language to encapsulate the conditional logic that generates dynamic content for an HTML page. In the runtime environment, JSPs are compiled into servlets before being executed on the Web application. Output is not limited to HTML but also includes WML, XML, cHTML,and DHTML. The JSP API for J2EE 1.3 is JSP 1.2. Chapter 7. Technology considerations 129 JSPs are the recommended choice for implementing the presentation (the view) that is sent back to the Web client. For those cases where the code required on the page is to be a large percentage of the page, and the HTML minimal, writing a Java servlet will make the Java code much easier to read and maintain. 7.8.7 JavaBeans JavaBeans are an architecture developed by Sun Microsystems, Inc. describing an API and a set of conventions for reusable, Java-based components. Code written to Sun’s JavaBeans architecture is called JavaBeans or just beans. One of the design criteria for the JavaBeans API was support for builder tools that can compose solutions that incorporate beans. Beans may be visual or non-visual. Beans are recommended for use in conjunction with servlets and JSPs. For example, the JavaServer Pages specification includes a set of tags for accessing JavaBeans properties.s 7.8.8 XML XML (Extensible Markup Language) and XSL stylesheets can be used on the server side to encode content streams and parse them for different clients, thus enabling you to develop applications for a range of PC browsers and for the emerging pervasive devices. The content is in XML and an XML parser is used to transform it to output streams based on XSL stylesheets that use CSS. This general capability is known as transcoding and is not limited to XML-based technology. The appropriate design decision here is how much control over the content transforms you need in your application. You will want to consider when it is appropriate to use this dynamic content generation and when there are advantages to having servlets or JSPs specific to certain device types. XML is also used as a means to specify the content of messages between servers, whether the two servers are within an enterprise or represent a business-to-business connection. The critical factor here is the agreement between parties on the message schema, which is specified as an XML DTD or Schema. An XML parser is used to extract specific content from the message stream. Your design will need to consider whether to use an event-based approach, for which the SAX API is appropriate, or to navigate the tree structure of the document using the DOM API. 7.8.9 Web Services Web Services is the label placed on the latest variation of a service-oriented architecture (SOA). Basically, when someone now refers to a Web Services 130 Patterns: Portal Search Custom Design application, they are referring to an application or application component that makes use of the XML-based standards, SOAP, UDDI, and WSDL: SOAP (Simple Object Access Protocol) is one of the key standards that make up the foundation for Web services. SOAP is a lightweight XML-based protocol for exchange of information in a decentralized, distributed environment. A SOAP message is composed of several parts. These include the envelope, header, body and the actual payload. The entire XML message is wrapped within an envelope (<Soap:Envelope>…</Soap:Envelope>) clause. Headers are optional, but may contain routing and authentication type of content. These are found within the (<Soap:Header> </Soap:Header>) clauses. The body contains the actual payload wrapped in the (<Soap:Body>…</Soap:Body>) clause. A SOAP message can select from one of two transmission styles. A message can be transmitted in a “document-style” or a “remote-procedure-call style”. UDDI (Universal Description, Discovery and Integration) is an XML-based framework to enable businesses to discover each other, define how they interact and share information in global registry. Combined with SOAP, the UDDI initiative was created to facilitate discovery of Web services over the Internet. The Web service developer registers the service definition and response specifications with the registry. WSDL (Web Services Description Language) is an XML format for the description of network services as a set of endpoints operating on messages containing either document-oriented or procedure-oriented information. When an application “finds” and available Web service from within a UDDI registry, it then interacts with that service using the information defining the service as described in the services WSDL. There are many existing IBM Redbooks and other sources of documentation that provide more details about the usage and concepts behind Web services. However, here are a few important considerations for the usage of Web services: Web Services based architectures provide for interoperability across the two main development environments in existence today; that is, Microsoft’s.Net initiative, and the open source J2EE environment. An application written in .Net can leverage Web services written in J2EE, and vice versa. Security is currently a concern with Web services environments. However, there are many standards in development to apply a level of security over SOAP messages, etc. The support for any of these security standards should be investigated when a product supporting Web services is purchased. Chapter 7. Technology considerations 131 7.9 Summary This chapter has provided an overview of the key technological aspects that should be considered when selecting a search product to implement your search solution. Overall, the following key questions should come to play when making such a search technology decision: What type of Query syntax is supported? Is a common data model supported for extended search capabilities? Are capabilities provided for simple index creation, or more advanced summarization and categorization? Is the security of document and data sources honored throughout the indexing and searching process? Is the technology built via a compotentization architecture that will allow for deployment flexibility, and ease in addressing any performance issues? Does the product provide a diverse and flexible set of client choices, and/or APIs for creating custom clients? By considering the answers to these questions, one should be able to select the best search product for any given environment. 132 Patterns: Portal Search Custom Design 8 Chapter 8. Application design Portal application design presents some unique challenges compared to traditional application design and development. These challenges apply to any search solutions built within a portal environment, as they need to adhere to portal design rules and limitations. The majority of the challenges are related to the fact that traditional applications were primarily used by a defined set of internal users, whereas Portal applications are used by a broad set of internal and external users such as employees, customers, and partners. This chapter attempts to define these challenges, and provide some best practices for dealing with this challenges for any portal/portlet based applications — including search solutions. © Copyright IBM Corp. 2004. All rights reserved. 133 8.1 Introduction In supporting a multitude of audiences, data from a wide variety of sources must be captured, managed, aggregated, and targeted to specific groups while also customizing the display formatting of the information for various client device types. The following list provides key issues to consider when designing Portal applications: The user experience, look, and feel of the site need to be constantly enhanced to leverage emerging technologies, as well as to attract and retain site users. New features have to be constantly added to the site to meet customer demands. Such changes and enhancements will have to be delivered at record speed to avoid losing customers to the competition. Portal applications in essence represent the corporate brand online. Developers have to work closely with the marketing department to ensure the digital brand effectively represents the company image. Such intra-group interactions usually present content management challenges. It is hard to predict the runtime load of Portal applications. Based on the marketing of the site, the load can increase dramatically over time. If the load increases, the design must allow such applications to be deployed in various high volume configurations. Security requirements are significantly higher for Portal applications compared to traditional applications. In order to execute traditional applications from the Web, a special set of security-related software may be needed to access private networks. The emergence of the Personal Digital Assistant (PDA) market and broadband Internet market will require the same information to be presented in various user interface formats. PDAs and various other “pervasive” (mobile) devices will require a lightweight presentation style to accommodate the low network bandwidth. Broad-band users on the other hand will demand a highly interactive, rich graphical user interface. The presentation of the information in the portal must be logically separated from the business logic and datasources so that it can be changed as required. 134 Patterns: Portal Search Custom Design The domain of e-business application users is typically much more diverse than that of the user group for traditional applications. Users can be known to the systems, can remain anonymous, and can come from inside or outside the enterprise. Web accessible applications must be developed to meet the varied needs of those different end user types. The diversity in user types and exposure to the outside world significantly increases security risks to internal systems. The e-business application security infrastructure and applications must be designed accordingly, and will likely require dedicated security components. Content for e-business applications can come from many sources: technical and non-technical, from inside or outside the enterprise. Such diversity in location, skill levels, and access create significant challenges for content creation and management. To meet these challenges, extensibility, maintainability, and scalability are critical aspects in the design of Web applications. The following sections provide some suggestions for meeting the diverse challenges of e-business, especially in regard to using the Portal composite pattern. 8.2 WebSphere Portal Services architecture diagram IBM’s WebSphere Portal (WP) is essentially a Web application that runs within the IBM WebSphere Application Server (WAS) environment. This allows WebSphere Portal to take advantage of the core services in WebSphere Application Server for connectivity to various data sources and applications (for example, IBM’s Directory Server for LDAP). In addition, WebSphere Portal provides a Portlet API that allows the developer to create compact Java Web applications that “sit on top of” the WebSphere Portal application and thus have access to all the core services via simple to implement tag libraries and extended classes. This allows a developer creating end-user services to avoid having to “re-write” core connectors to WebSphere Application Server services each time they want to provide functionality. Chapter 8. Application design 135 Secure Way, Domino, Netscape, Active Directory LDAPs WebSphere Common User Subsystem's DB WPS Database, Policy Director or Netegrity Siteminder Authorization Enrollment Portlet WebSphere Common User Subsystem WPS Database or Tivol Policy Director Vault Credential Vault Self Care Portlet Apps Portlet WAP Aggr. iMode Aggr. Search Portlet Content Management WPS Content Organizer Content Integration Packs for Interwoven, Vignette, Documentum Customer Portlets Portlet PDA Aggr. Portlet Proxy Portlet Registry Integrated Local Search (Juru) Domino Extended Search, EII Third-Parties (Autonomy, Verity) Admin Portlets Voice Aggr. User Config DB2 or Oracle Portal Engine Portlet Portlet Data Web Service SOAP PC Aggr. Portlet API (Java) PD WebSeal WTE Seal Netegrity Siteminder WebSphere Security Authentication SOAP Router For example, Notes, Exchange, Remote Portlet Site Analyzer Any web service .NET or J2EE based Portlet published as RPWS Once standard exists: .NET based RPWS J2EE ased RPWS Predefined reports for WPS Log Analysis Tivoli (Future) WP Data Store Figure 8-1 WebSphere Portal Server 4.1 Component Architecture For example, WebSphere Portal can leverage the concept of Web Services by allowing a developer to create their own portlets or use the existing Web services portlets with tag libraries to enable this communication. They can avoid having to write their own SOAP based wrappers and use the common wrappers available to WebSphere Portal via the core services in WebSphere Application Server. 8.2.1 Single-Tier versus Multi-Tier design There are two perspectives in the application development and systems architecture realm. The first perspective describes keeping all functional capability on a single system and using a single “codebase”. This allows for a “single point of failure” and makes it more difficult to determine the root cause for possible functional or infrastructure based system problems. In addition, this requires that any functional modifications will have an effect on other systems. There is no abstraction between the presentation, business logic, and data source tiers. 136 Patterns: Portal Search Custom Design The second perspective describes the separation of functionality and “functional concerns”. This is exemplified by an architecture that has the business logic, data sources, transaction processing/management, presentation layer, and security mechanisms logically separated but working in concert to provide a single set of functionality. This identifies several possible points of failure but also provides for easier problem determination when system difficulties occur. For example, if the user cannot login and authenticate with the portal, then the first place to look is the authentication/security mechanism. This, in turn, includes the “rules” that govern how the portal responds to requests (including user type definition, group definition and their privileges), and the LDAP directory (containing user and group profile information and meta-data). A multi-tier design is preferable because it allows a separation of concerns between the presentation, business logic, and datasource environments. It uses the application server, in conjunction with security, directory, and business rules mechanisms to by the “integration hub” where data is aggregated. The presentation layer leverages the aggregated data to provide end-user display and datasources are separated and accessed via a common set of connectors via the application server’s core services (for example, JDBC, JMS, JCA, or Web Services). 8.3 Portal solution guidelines The design of a portal system and architecture adopts many of the general tenants for e-business application design. In addition, using the Portal composite pattern as a guide, consider these guidelines when designing a portal implementation: The application server node is the central mechanism where data is “pulled” from multiple data sources. In a portal implementation it is sound architectural practice to provide a central, yet separated (loosely coupled) security mechanism that enhances maintainability and especially extensibility of the authentication and authorization mechanisms. New policies can be implemented without the need to modify other systems in the portal implementation. Use a component based architecture that isolates functionality (for example, business logic versus data sources versus security versus presentation) allowing the enhancement of specific areas of functionality with minimal affect on the whole system. This is often referred to as a “separation of concerns” and is similar to the MVC based application architecture. It makes sense to apply these same concepts of MVC based design to the larger environment where disparate applications and datasources are being “integrated”. Chapter 8. Application design 137 The use of a separated directory service allows for the upgrade of user and organization profile management without affecting the rest of the system. The portal system should be based on a set of functionally separated components leveraging common connectors for communication. The concept of workflow should be a single mechanism and can be used to manage both content “assets” that are contributed to the portal system and inter-application communication. An example of this management of communication between components is most evident when some type of queueing mechanism is used (for example, IBM’s WebSphere MQ) to provide guaranteed message communication between components. Content (including data from applications, databases, external datasources, and people) should be managed centrally. In some cases, it makes sense to pre-format the content before providing this content to the presentation or business logic (application server) mechanisms. Collaboration should leverage common security mechanisms and provide both asynchronous (for example, content management, discussion forums) and synchronous interaction (for example, instant messaging and “chat” facilities). The security of instant messaging is still relatively immature but is available in products such as Lotus’ Sametime. Leverage the single set of aggregated content and reformat for various device types. Sometimes, it seems most expedient to just duplicate the content and reformat for different device types. This places a maintainability burden on the system. It is more effective to use the same core set of aggregated information and apply end-user display templates (for example, using XSLT applied to XML formatted data) to format content. In this way, as new end-user display formats arise, more templates can be added while leaving the content storage and management mechanisms and processes relatively untouched. This saves time and money. Single Sign-On is important to provide seamless access to various datasources through a single interface. This provides for a “easier to use” user experience and allows for easier maintainability of user account information (through leveraging the concept of a central directory and authentication mechanism). 138 Patterns: Portal Search Custom Design 8.3.1 Model-View-Controller design In the Model-View-Controller design shown in Figure 7-2, Model represents the application object that implements the application data and business logic. The View is responsible for formatting the application results and dynamic page construction. The Controller is responsible for receiving the client request, invoking the appropriate business logic, and based on the results, selecting the appropriate view to be presented to the user. A number of different types of skills and tools are required to implement various parts of a Web application. For example, the skills and tools required to design an HTML page are vastly different from the skills and tools required to design and develop the business logic part of the application. In order to effectively leverage these scarce resources and to promote reuse, we recommend structuring Web applications to follow the Model-View-Controller design pattern: The model represents enterprise data and the business rules that govern access to and updates of this data. Often the model serves as a software approximation to a real-world process, so simple real-world modeling techniques apply when defining the model. A view renders the contents of a model. It accesses enterprise data through the model and specifies how that data should be presented. Web services and Web services technologies It is the view's responsibility to maintain consistency in its presentation when the model changes. This can be achieved by using a push model, where the view registers itself with the model for change notifications, or a pull model, where the view is responsible for calling the model when it needs to retrieve the most current data. A controller translates interactions with the view into actions to be performed by the model. In a stand-alone GUI client, user interactions could be button clicks or menu selections, whereas in a Web application, they appear as GET and POST HTTP requests. The actions performed by the model include activating business processes or changing the state of the model. Based on the user interactions and the outcome of the model actions, the controller responds by selecting an appropriate view. Chapter 8. Application design 139 Model Encapsulates application on state Responds to state queries Exposes application functionality Notifies views of changes State Query State Change Change Notification View Renders the models Requests updates from models Sends user gestures to controller Allows controller to select view View Selection User Gestures Controller Defines application behavior Maps user actions to model updates Selects view for response One for each functionality Method Invocations Events Figure 8-2 The Model-View-Controller design pattern MVC and Web Services An MVC architected application can leverage classes at the action level to invoke communication via SOAP to access Web Services. In fact, the action class can be used as the gateway for communication via other methods. Using the paradigm allows the application and its architecture to remain in its current form, and it can treat the Web Service as just another data source. In Figure 8-3, an example of this is diagrammed. 140 Patterns: Portal Search Custom Design Figure 8-3 An MVC based application communicating with a Web Service via SOAP Also note that if you couple an MVC architected application with another business logic and data access layer following something like the Command-Manager design pattern, the access to Web Services can still be through the action class, or alternatively through a datasource level Manager class. While the Portal composite pattern is at the operational architecture level, the design of the application at the application architecture level also has impacts on which connector technologies can be used. In some cases, it makes sense to use Web Services and in other cases it makes sense to use JMS or JCA. Chapter 8. Application design 141 MVC and portlets Portlets must be capable of supporting display output to multiple Internet browsers running on many communications devices. These browsers typically require different display markup languages. Smaller devices have smaller display screens and typically have more limited handling of display markup. Portlets can support multiple browsers and device types by being implemented with the model-view-controller (MVC) design pattern. This design contains three entities: The model, the data source to be retrieved for the portlet: Model data for a portlet is typically retrieved from an external data source and loaded into Java display beans, or arrives formatted in an XML document. The view or views, the output mechanism used to display the data of the portlet: Display views are typically implemented as either JSP's, more typically used when the data model is implemented in Java beans, or XSLT style sheets when the incoming data is formatted in an XML document. The controller, which joins the selected view to the data and conducts the operation of the portlet: The controller selects the view for display based on the target device or browser, and then passes the data model to the view. The view extracts the specific display data, formats the data for the browser and renders its output to the browser as part of the portal aggregation of portlet outputs. For portlet development, the MVC pattern has the following characteristics: The portlet is only responsible for calling the right controller, depending on the markup supported by the client. Connectors are responsible for accessing content sources. Typically, there is one connector per content source type, for example, one connector for POP3 access and one for file-based cache. Models represent the content as retrieved through the connector. A model is independent of the presentation. Controllers are responsible for providing the appropriate markup (HTML, cHTML, or WML) for the content. In the MVC structure, there is a distinct separation of data from presentation along with a controller component for managing the interaction between the data (model) and the presentation or view. The controller knows the environment in which the application is invoked, gathers information from the data object to be displayed, and then applies the appropriate view to render the data using the markup language appropriate for the current device. 142 Patterns: Portal Search Custom Design A portal system is described as all of the business logic, user data, existing data sources (databases and applications), and supported end-user channels that contribute to an aggregated view of information targeted to specific user “groups” or “types”. The fundamental characteristics of a portal are: Information aggregation Targeted and personalized information Managed content Single sign-on The characteristics and considerations in portal implementation are described in 2.3.7, “Portal characteristics” on page 28. 8.3.2 Content management guidelines In a portal implementation, content management provides a type of collaboration and a mechanism and process for managing how information gets added and removed from the portal system. A content management system should provide the following capabilities: Workflow Content creation, approval, deletion, formatting, publishing Access control lists Content “asset level” and “edition level” versioning The implementation of content management involves these guidelines: 1. Define the types of content: The content can defined as “documents” or other binary data or more discretely as “pieces” of content that are meant to be combined into a single view. This identification includes identifying the source, format, and update schedule of the content. 2. Define the location and/or source of the content: Knowing the location of existing and sources of new content is important. For example, if the content is a news feed (RSS formatted) from an external source or if the content is being contributed by users from several departments, all of this will have to be taken into consideration when defining the types of users and their access to the content management process and mechanism. 3. Define the process by which the content will be contributed and eventually published: The contribution and publishing of content can have impacts on existing process in the organization. Each group will likely have different methods for collecting and organizing the content they maintain. These processes will Chapter 8. Application design 143 have to be analyzed to determine if they need to be changed, adopted or ‘not used’ in lieu of other organizational and process changes that may be occurring. These decision can have an impact on what functionality the portal will provide and how that functionality will leverage other systems. This analysis should take place before the architecture is completed. In addition, the publishing of content may be impacted by existing security policies in an organization and these will have to be taken into account when determining how content will get from the content management system to the presentation mechanism (for eventual display to the portal user). 4. Define the versioning scheme content: Versioning allows the ability to revert to previous incarnations of the portal site. The content and/or display templates can be versioned. Each content management package handles this differently and with varying success. 5. Define the expiration scheme of the content: The expiration of content is an important concept because is will affect changes on the production systems. It is important that whatever packages or technologies are used to enable this, take into account how deletion of content will affect other systems that may depend on that content. In a properly designed, loosely coupled architecture, these impacts are mitigated because the content management process only identifies what changes (or deletions are necessary) and allows other systems to handle the mechanics of the deletions. This also implies that before implementing content management, a complete understanding of the types of content and what systems depend on that content needs to be completed. 8.3.3 Single sign-on guidelines From the user’s perspective single sign-on (SSO) is the ability to move between applications without being prompted for a userid and password (or certificate) when moving from one application or datasource to another. These applications and datasources can be on the same or different physical servers. Guidelines for implementing SSO are as follows: Leverage a central user directory Provide a single mechanism for intercepting user requests and passing security credentials to various applications and datasources Provide a session or authentication timeout for a user’s logged in session Provide a variation of the SSO interface for various client device types Identify the various user types and link them with the business rules engine Enable encryption of the “data packets” as authenication is being performed 144 Patterns: Portal Search Custom Design Update security policies to enhance the physical security of the SSO mechanism that intercepts authentication requests for existing applications and data sources Agree to a common format for usernames and passwords Agree to a process for updating SSO account information for usernames and passwords (for example, passwords cannot be retrieved if lost, they must be reset) 8.3.4 Collaboration guidelines Collaboration is a vital part of a portal implementation. It helps organizations ease their transition through process change brought on by the data consolidation efforts resulting from a portal implementation. In addition, it provides a mechanism for organizations to communicate with their target audiences so that they can understand the appropriateness and value of the information provided to these audiences. When implementing collaboration, it is important to understand these concepts and address them when designing, implementing, deploying a portal system: Standards based approach: The realm of standards for collaboration (and “collaboration services”) is very young and evolving. It would be worthwhile, to develop applications based on specifications such as JMS, which would make the application vendor independent and hence portable. However, in certain areas like instant messaging (JAIN, http://jcp.org/jsr/detail/165.jsp), the specifications are still in the community process and one may require to adopt a vendor-specific API. In such circumstances, the use of design patterns become paramount so that the collaborative modules are loosely coupled with the core application modules. Thus, the use of a loosely coupled architecture for your portal solution allows the use of “packaged” solutions that provide both synchronous and asynchronous interaction and prepare the solution for incorporating a standards based set of technologies in the future. User Management: The major concern here is the coupling between the user directory (and/or discovery) service and the collaborative service. In most cases, the collaboration service (CS) provider can be configured to exclusively use its own or another directory service. This type of setup is ideal since it provides single point of control and extensibility. However, in some cases, the CS provider would use its own directory service. In such a scenario, the directory services might require frequent replication/ synchronization. This, of course, implies that the two directory services are either based on the same standards, or, we have a synchronization tool that takes care of the conversion. This is not a best practices setup. Ideally, the Chapter 8. Application design 145 collaboration service should leverage the central directory where users and groups are described. Security: Collaboration essentially involves peer communication and hence each client is more powerful than in a server-based pattern. It becomes essential from the systems and application perspective to provide a domain of activity for a peer. Most CS providers allow peer management through access control lists that can be managed through a central directory/discovery server. The CS server uses these ACLs to allow services to peers. The authentication and authorization procedures are API specific and also to application requirements. However, most CS providers such as Lotus Sametime support for standards such as SSL and SOCKS proxy through their API. These can be utilized, in conjunction with the user profile/ACL based security to provide both user authentication and user authorization. 8.3.5 Web services guidelines Web services constitute a distributed computer architecture made up of different computers communicating over a network to form one system. At the current time there are two competing application paradigms being put forward. One by Sun Microsystems and the other by Microsoft. The Sun model is called the Sun Open Net Environment (Sun ONE), an open framework that supports “smart” Web services, and in which the J2EE platform plays a fundamental role. The Microsoft application paradigm is called .NET. In fact, using Web services, either .NET or Sun ONE services can be accessed by a Web services requester. However, there are still issues and problems when communicating between J2EE-base and .NET-based applications. Thus, the first best practice that we would suggest is to avoid mixing these application paradigms if at all possible. Having said that, keep in mind that many .NET services are being successfully accessed by J2EE-based applications, and vice versa. In this section, we focus on best practices for Web services development and deployment within a J2EE environment, that is, Web services that are built using servlets, JSP pages, EJB architecture, and all the other standards that are part of the J2EE technology. 146 Patterns: Portal Search Custom Design Apply distributed computing principles Think of Web services as another technology for developing distributed systems. All of the best-practice principles used in developing distributed systems apply to Web services. All of the considerations that would go into any enterprise systems design apply to Web services, such as high availability, high throughput, clustering, hardware management, and network topology. The main difference between most distributed systems and Web services is that Web services are newer. Most Web services software is less than a year old. So as a rule, there is not the same level of reliability, security, or performance that you would find with other distributed systems software that have been around longer. Another factor is that Web services are built on a set of technologies (SOAP, XML, WSDL, UDDI) that are still evolving and are being evolved by separate standards organizations and vendors in parallel. It will be some time before all these standards will be able to converge (especially given the Sun versus Microsoft debate). Because of the lack of a solid set of standards, implementation details are left for individuals. Still, some common principles can be adopted as best practices at this time: Design systems that are layered: This is the same principle that you would apply to any distributed, component architecture. It is especially important in Web services applications where we do not have control over some components (services) that we access in our application. Design course-grained Web services: Web services have all the same issues as those of distributed systems when it comes to requesting a remote service. Requesting a service from a machine over the network is more expensive than a local operation. With this in mind, keep the request as coarse grained as possible when requesting a Web service from a remote machine. Existing Java beans or EJBs with fine-grained methods or operations should be aggregated into a single coarse-grained Web services, wherever possible. This technique avoids unnecessary network traffic and overhead on the communication stack. This also makes it possible to push the transaction integrity requirements to the Web services provider making for a cleaner design. In other words, if a coarse-grained request did not successfully complete, then the Web service provider can roll back that entire transaction. Design for “loosely coupled” components: A Web service by definition is an interface to a loosely coupled component on a remote system. Therefore, it is very important to be cognizant of the impact of integrating loosely coupled components. With this in mind, define clear contracts between layers and services, but utilize the “Parameter List” paradigm where possible. Chapter 8. Application design 147 Limit dependency on other components: Managing dependencies is one of the key challenges in utilizing Web services in an intranet or extranet scenario. Common dependencies that occur in an application design are: – Call flow dependency: Business processes implemented by systems are not typically within the domain of one business component. – Object association dependency: Using object-oriented techniques, it is easy to model a business problem by associating objects together. However, from an implementation perspective, doing so increases the linkage from one component to another. Use interfaces where possible. Implement all cross "domain" business processes in a "control" or "workflow" layer. The flexibility of an application is increased if all business processes that cross multiple business domains are implemented in a workflow layer. In doing so, the application architecture has more flexibility in what is called, when it is called, managing the call (such as exception handling), and performing any translation on the data that is passed in or out. For more detailed guidelines on Web Services, refer to Patterns: Self-Service: Connecting to the Enterprise, SG24-6572. Web services and Microsoft’s .NET Using IBM WebSphere Studio Application Developer, it is possible to create and run a Web service and to invoke the Web service from client applications created using the Microsoft .NET Framework SDK. Using the Web services wizard within Application Developer, you can generate WSDL proxies for consumption by these Microsoft .NET clients in both the C# and JScript programming languages, which are both currently supported by the Microsoft .NET Framework. It is even possible to combine proxies created in both languages into single executables. For details on how to create these integrated applications, see the WebSphere Developer Domain article Developing Microsoft .NET Web Service Clients for EJB Web Services with IBM WebSphere Studio Application Developer and the Microsoft .NET Framework SDK: http://www7b.software.ibm.com/wsdd/techjournal/0204_wosnick/wosnick.html?open& l=851,t=gr Note: C#, pronounced "C sharp," is a new Microsoft programming language similar to Java and C/C++. 148 Patterns: Portal Search Custom Design 8.4 Summary It is important that the portal design be based on a loosely coupled architecture. This provides for the separation of components so that they can be replaced, enhanced, or removed with little effect on other systems. In addition, the use of common connectors is vital to linking the components together into single virtual system. Connection technologies such as JMS, JCA, and Web Services are methods that many vendors agree with and support allowing the portal implementation that uses these technologies to bring together the best of breed. 8.5 Where to find more information In this section we list some information sources for reference: For information on the IBM Patterns for e-business: http://www.ibm.com/developerworks/patterns Web Services Security (WS-Security): http://www-106.ibm.com/developerworks/webservices/library/ws-secure/ Web services architecture using MVC style: http://www-106.ibm.com/developerworks/webservices/library/ Developing Web Services: http://dcb.sun.com/practices/howtos/developing_webserv.jsp Model-View-Controller design pattern — SUN Microsystems: http://java.sun.com/blueprints/patterns/MVC-detailed.html Chapter 8. Application design 149 150 Patterns: Portal Search Custom Design Part 4 Part 4 Technical scenario © Copyright IBM Corp. 2004. All rights reserved. 151 152 Patterns: Portal Search Custom Design 9 Chapter 9. “Chrisco Books” scenario To illustrate the Portal Search custom design, this chapter describes a sample intranet portal scenario for a fictitious technical book publisher called “Chrisco Books”. The example described is a demonstration of the need for a single portal-based search capability across Web sites (HTML), file systems, content management systems, and Lotus Domino databases. Although this example scenario was created for a publishing company, the same solution should meet the needs of most companies where unstructured textual information is contained in various disparate repositories, and where multiple search technologies have already been deployed in point solutions. © Copyright IBM Corp. 2004. All rights reserved. 153 9.1 Chrisco Books scenario: story line Chrisco Books writes and publishes technical journals and books for the IT community. The company consists of subject matter experts, technical writers and editors, and customer service representatives. The company has recently deployed an employee portal providing employees the ability to collaborate in team workplaces, and access human resource information and industry related news feeds. The company would like to extend the portal implementation in order to provide employees the ability to locate all the details about any given book project — be it a completed project, legacy “paper” based project, an in-progress project, or even related materials and knowledge available on the internet. In their current environment, the company maintains various data repositories, including: File systems, where content creators and editors work on journals/books in progress An intranet site, for published drafts and final work products A Lotus Domino based project tracking application, where all information pertaining to the status a particular project is maintained A content management system, containing legacy books and journals that have been converted to an image format for electronic storage, as well as other image files for new book projects In some of the cases, the data repositories have existing “point” search solutions that were implemented over the years to allow for basic searching. However, any user must locate the specific search capability, then query each of these multiple search engines to find the need information, and then cross-reference the results to determine what has been found. Overall, the company has determined that the second phase of their portal rollout should provide the following business benefits: 1. Help streamline current business activities, by improving organizational efficiency and reducing the latency of business events. When this desire is applied to search capabilities, one can see that these efficiency improvements will ultimately come from: a. Distilling meaningful information from a vast amount of structured and unstructured data b. Providing easier access to vast amounts of unstructured data through indexing, categorization, and other advanced forms of summarization. 154 Patterns: Portal Search Custom Design 2. Chrisco Books also has a desire to reduce their IT/technology costs by: a. Reducing spending on maintenance and training of legacy system interfaces. b. Reducing deployment and implementation costs for any new systems. 9.2 Chrisco Books scenario: requirements In any technology effort, one of the most important first steps is to be sure you understand what it is you are trying to build or implement — and its relationship to the real business problem at hand. This section describes the requirements for our Chrisco Books scenario. 9.2.1 Functional requirements In any fictitious scenario, such as our book publishers search needs, we have full reign over the functionality being provided. Rather than bore you with elaborate Unified Modeling Language (UML) use cases, that go into an exhaustive listing of pre-conditions, post-conditions, and alternate path scenarios, we are taking the abbreviated approach of providing very brief one or two sentence descriptions of the functionality available to the various users in our scenario. For this scenario, we define several key user roles: Authors and Editors: Are ultimately responsible for the writing and editing of complete technical documents, based on input from Subject Matter Experts. This includes validating the source and accuracy of information provided by such Subject Matter Experts. Subject Matter Experts: Are responsible for researching and drafting key portions of the technical content, as contribution to the efforts of the authors and editors. SMEs should not have access to project management information, as this includes information regarding payment of SMEs, etc. Customer Service Representative: Are responsible for handling questions originating from customer calls or instant messages, by pointing them to the available books, and discussing upcoming book availability. They have other billing and service responsibilities that are outside the scope of this effort. Administrators: Are responsible for maintaining the portal, search engines, and back end applications. Customers: Are interested in buying and reading the books produced, as well as understanding what upcoming books are planned. All users should be able to save their favorite search queries, so that they may more easily access them in the future. Chapter 9. “Chrisco Books” scenario 155 The user scenarios for the portal search users are shown in Figure 9-1. Note: These scenarios only cover those related to searching from within the portal framework, and do not include user scenarios related to the overall portal solution. PortalSearch System SearchPublished Books Customer «extends» Author/Editor SearchProject Mgmt/Status Data «extends» Search Work in Progress book files Customer ServiceRep Utilizesavedsearche s andpersonalization SME Searchinternet andother w eb know ledge Administrator Administersystem Figure 9-1 User scenarios for new portal search functionality 9.2.2 Non-functional requirements In addition to the business requirements, it is important to clearly articulate the non-functional requirements associated with this business need as well. This includes information such as: Target repositories: What data is being searched? The physical location and data formats, supported queries, platform. Information retrieved: What information should be returned? Documents, people (that is, expertise). Metrics: Number of concurrent searches, response times. 156 Patterns: Portal Search Custom Design As with user scenario generation, the creation of full non-functional requirements is beyond the scope of this book, and is not truly needed for such a fictitious scenario. However, in this section we will briefly describe some non-functional aspects. Target repositories As described earlier, Chrisco Publishers makes use of various repositories. These repositories are located within the US and are based on various content platforms. However, the user population is spread out across globe. Thus, a Web based solution to accessing these repositories is required from a non-functional standpoint — as such a thin client approach would be the only option for providing performance on such a distributed basis. File system This repository contains Adobe Framemaker, Microsoft Word, and IBM Lotus SmartSuite® files used by writers and editors. The file system used in our scenario is a shared Windows 2000 Server drive, but could be any generally accepted file services alternative; such as Novell Netware, or even Samba on Linux. “Legacy” search capabilities are provided for this repository in the existing environment via standard Windows operating system file search capabilities. Intranet site This repository contains published books and other documents in Adobe PDF and/or HTML file formats. Books are presented in a user interface that includes summary information about the file, including an abstract, table of contents, and list of authors. “Legacy” search capabilities are provided for this repository via an existing Web search engine. Users are presented with a browser based search interface, which allows searches via simple boolean logic. Project tracking application This repository includes project management documents and information maintained in an IBM Lotus Domino application. The information included consists of client satisfaction surveys, time lines for the book effort, details on book contents, as well as a list of people involved in the books creation (subject matter experts, authors, editors, researchers,...). “Legacy” search capabiltites are provided for this repository via the built in indexing and searching in the IBM Lotus Notes/Domino technology. Chapter 9. “Chrisco Books” scenario 157 Content management This repository contains older books and journals in various image formats that would otherwise be unavailable electronically. Additionally other graphics/image files that are needed for current projects are also included, all stored within IBM Content Manager. Metadata has been associated with the files to facilitate the search and retrieval of information. Search capabilities are provided the IBM Content Manager windows client. Figure 9-2 depicts these multiple data repositories and search solutions as they exist today. OS File Services Index Built in OS File Search WIP FileSystem (Win2k) Notes Indexing Search Project Tracking (Notes) Notes client Search User Search Custom HTML search interface HTML Index Legacy Web Crawler Published Books (HTML,PDF) Search Content Mgr Indexing Content Manger Client Content Mgmt (IBM CM) Figure 9-2 Existing data repositories, with current search capabilities 158 Patterns: Portal Search Custom Design 9.2.3 Summary of requirements After considering both the business and non-functional requirements, the desires of Chrisco Books can be summarized as follows: This “phase 2” portal solution should provide users the ability to search various repositories from a Web single interface, which is tightly integrated into the new portal “workplace” to which their employees are growing accustomed. Results from the multiple repositories should be combined and normalized into one user results list. Additionally, the appropriate content should be delivered in this results list based on the employee’s role and access. 9.3 Patterns mapping Now that the business requirements, and key technology drivers, have been clearly identified, it is time to investigate the best practices available within the industry for such a solution. This is done through the application of, or “mapping” to, the Patterns for e-business. 9.3.1 Examining the business requirements As we begin to examine the various “search” oriented application patterns, it is easy to see that the existing environment described earlier in this chapter basically maps towards the Application Integration::Population: Index Population and Information Aggregation::User Search and Discovery application patterns we discussed in Chapter 4. This mapping of the patterns onto the current environment is shown in Figure 9-3. Chapter 9. “Chrisco Books” scenario 159 Index Population User Search & Discovery OS File Services Index Built in OS File Search WIP FileSystem (Win2k) Index Population User Search & Discovery Notes Indexing Search Project Tracking (Notes) Notes client Search User Search & Discovery User Index Population Search Custom HTML search interface HTML Index Legacy Web Crawler Search User Search & Discovery Content Manger Client Index Population Published Books (HTML,PDF) Content Mgr Indexing Content Mgmt (IBM CM) Figure 9-3 Existing environment, mapped to application patterns However, this existing environment consists of multiple search interfaces and their supporting applications, in a manner that does not provide a single business solution — and is not working for Chrisco books today. Therefore, we must next consider the key business requirements specified in our scenario that are needed to improve the situation: Improve organizational efficiency. Reduce the latency of business events. Distill meaningful information from a vast amount of structured and unstructured data. Provide easier access to vast amounts of unstructured data through indexing, categorization, and other advanced forms of summarization. 160 Patterns: Portal Search Custom Design While the existing environment map be helping to distill meaningful information, it is clearly not improving organizational efficiency and reducing the latency of business events; as the existing environment is only providing benefits at workgroup/individual task level. Chrisco Book’s just needs these benefits to be provided on an enterprise level, so that true organizational efficiency can be achieved. 9.3.2 Solution options There can be multiple patterns based solutions to any business problem. Therefore, it is important to analyze each possible option to determine the best fit for Chrisco Book’s. Option 1: Single Index Based on the analysis so far, one alternative for a solution for Chrisco Book’s needs, would be to implement the already identified Information Aggregation and Application Integration patterns (Index Population and User Search and Discovery) that meet the business requirements, on a single enterprise wide basis, as depicted in Figure 9-4. Index Population WIP FileSystem (Win2k) User Search & Discovery Project Tracking (Notes) Search Portlet Search Index Crawler/ Indexer User Published Books (HTML,PDF) Content Mgmt (IBM CM) Figure 9-4 Solution 1: Single Index Chapter 9. “Chrisco Books” scenario 161 Note: The Application patterns used here are discussed in detail within Section 4.2, “Application Integration patterns” on page 45 and Section 4.3, “Information Aggregation patterns” on page 57. While such a single index solution would meet the business requirements for Chrisco Books, it might also have a large technology/implementation cost associated. For example, the data storage requirements for the creation of a single index encompassing all of the data sources might be substantial. Additionally, it may be difficult to find an “off-the-shelf” product with the ability to talk to all of the platforms Chrisco utilizes for data collecting/indexing — and thus substantial development efforts could be required to build such a single index via custom code. The question is, are these IT impacts acceptable to Chrisco Books? Option 2: Search brokering/federated search By analyzing the first solution option, we have found that looking at the business requirements alone is not enough; and we must then revisit the IT drivers that Chrisco books identified: Reduce spending on maintenance and training of legacy system interfaces. Reduce deployment and implementation costs for any new systems. When combining these IT drivers with the business requirements, it is clear that the search adapter/search service variations of the Information Aggregation::User Search and Discovery application patterns maps to these IT drivers. A solution based on the inclusion of this variation of the User Search and Discovery pattern is shown in Figure 9-5. 162 Patterns: Portal Search Custom Design Index Population OS File Services Index Built in OS File Search API WIP FileSystem (Win2k) Index Population Notes Indexing User Search & Discovery Search Adapter/Search Service variation Search Broker/ Federator Project Tracking (Notes) Notes client API Connector(s) Connector(s) Connector(s) Index Population Search Portlet Custom HTML search interface HTML Index Legacy Web Crawler Index Population Published Books (HTML,PDF) Content Mgr Indexing User Content Manger Client API Content Mgmt (IBM CM) Figure 9-5 Solution 2: Extended search solution Note: The additional Application pattern variation used in this solution is discussed in more detail within “User Search and Discovery application pattern” on page 61. This solution meets the same business drivers as the “Single Index” solution, as it includes the same Information Aggregation and Population patterns. The addition of the extended search federation/brokering technology, via the Search Adapter/Search Service variation of the User Search and Discovery application pattern, then meets the cost reduction IT drivers specified by Chrisco Books. In this solution, the User Search and Discovery based application is the only real new capability that must be developed and deployed. The existing population based applications are left intact, with search connectors interfacing with these existing capabilities. Of the two solution options presented, this “extended search” option would probably have the lowest deployment and overall IT costs — as it provides for a large amount of reuse. Chapter 9. “Chrisco Books” scenario 163 Additionally, the maintenance costs associated with the solution should also be minimized, as any of the data repositories can be pulled out and replaced without required modification to the entire solution. In the worse case, only one of the “connectors” would need to be modified or replaced. Based on these IT cost savings, this second option is clearly the correct choice for Chrisco Book’s needs. 9.3.3 Integrating the solution For most of this chapter we have focused more directly on the search needs and requirements of Chrisco Books. However, it is important to highlight the integration of this solution into the overall portal environment, and the Portal composite pattern/Portal Search custom design. Since the solution chosen delivers these search capabilities via a “portlet”, full integration into the context of the portal is guaranteed. Thus, these Information Aggregation and Application Integration capabilities will be available right alongside the other Self-service, Collaboration, and Access Integration capabilities of the full portal solution. 9.4 Expanding the scenario So far in this chapter we have described a business problem, that being Chrisco Books search needs, which is probably a bit more simplistic than a real world problem. However, this scenario can be easily built upon with additional functionality, by adding additional patterns and concepts as needed. Here are a few examples of how this scenario can be expanded: The content of the portal itself could be included as one of the repositories that is searched. The addition of additional sources is fully supported by the proposed solution, as one of the existing Index Population applications could crawl and index this additional data source, or another connector would simply be needed to allow the User Information Access based application to “extend” its search to this additional source. Alternatively, additional sources could be integrated by introducing the Application Interation:Federation application pattern as discussed in 4.2.4, “Federation application pattern” on page 55. In this case, the additional data sources would be unified behind a federation tier, that would then portray the “image” of a single data source to the User Information Access application. 164 Patterns: Portal Search Custom Design More advanced search capabilities could be added, such as person and expertise identification (that is, tacit knowledge), taxonomy and categorization, etc., could be added to this scenario. In this case the same application patterns would apply, as the “Search, Discovery, and Indexing” tier of the Application Integration::Population: Index Population application pattern supports these advanced capabilities. Such advanced search capabilities could also be applied in a more “dynamic” fashion, by allowing them to be performed in a second step. That is, users would perform a search and then optionally choose to categorize the search results into a taxonomy for easier analysis. Other enterprise systems, such as the CRM systems utilized by the Customer Service Representatives, could leverage the same search capabilities and interface. Customer Service representatives would then have one location to both research, and look up details about a given customer’s interaction with Chrisco Books. The search capabilities could also be expanded to include a more “context aware” search, that would better leverage the portal environment in which these search capabilities are deployed. For example, capabilities could be include such that any text or phrase could be searched on via a simple right mouse click. The sources searched would be determined by the current location or portlet being used within the portal. This would be an example of adding the Application Integration::Direct Connection pattern to the solution. This pattern is described in more detail on the patterns Web site: http://www-106.ibm.com/developerworks/patterns/application/at1-runtime.html 9.5 Summary In this chapter, we introduced our Chrisco Books scenario by describing the overall business context, including the business and non-functional requirements. We then examined these requirements, to identify the Application patterns that clearly address the problem at hand. Finally, we analyzed the various options for implementing the identified patterns, and discussed some ways in which this scenario could be expanded further. In the next chapter, we will detail how we actually implemented the various Application patterns identified in our solution, by leveraging IBM technologies. Chapter 9. “Chrisco Books” scenario 165 166 Patterns: Portal Search Custom Design 10 Chapter 10. Technical implementation of the scenario In this chapter we provide some details on the technical implementation of our Chrisco Books scenario. Screen-by-screen installation instructions are not provided for all products, but rather, we attempt to provide some understanding of the key architectural decisions and product mappings made to implement the scenario within the Redbooks testlab. © Copyright IBM Corp. 2004. All rights reserved. 167 10.1 The runtime environment Prior to actually implementing anything, we must take the proposed solution formulated in the previous chapter, and map it to the runtime and product levels. Figure 10-1 provides a quick review of the solution we are implementing when viewed from an Application patterns level. Index Population OS File Services Index Built in OS File Search API WIP FileSystem (Win2k) Index Population Notes Indexing User Search & Discovery Search Adapter/Search Service variation Search Broker/ Federator Project Tracking (Notes) Notes client API Connector(s) Connector(s) Connector(s) Index Population Search Portlet Custom HTML search interface HTML Index Legacy Web Crawler Index Population Published Books (HTML,PDF) Content Mgr Indexing User Content Manger Client API Content Mgmt (IBM CM) Figure 10-1 The Chrisco Books solution that we have chosen to implement When we map this to the Runtime patterns defined in Chapter 5, “Runtime patterns” on page 67, and then remove the firewall and other infrastructure aspects that are not needed in our lab environment, we are left with the runtime “picture” of our scenario infrastructure shown in Figure 10-2. 168 Patterns: Portal Search Custom Design Population Directory and Security Services Federation Personalization Server Database Server Search Presentation Server Collaboration Application Server Content Management Figure 10-2 Chrisco Books scenario — runtime environment Using this model runtime environment as a guideline, we then choose the appropriate products for each of these key runtime nodes. However, we know that Chrisco Books already has some existing technologies in place in our scenario, specifically: A WebSphere Portal implementation An IBM Content Manager based content manger system A Lotus Notes/Domino based collaboration system that also provides some project tracking and content management capabilities. These pre-existing technologies make many of our product mapping decisions for us, leaving only the “search” node as the main product choice required. We choose Lotus Extended Search (LES) as the product to provide the search node capabilities, as the scenario solution requires a product that provides the “extended search” variation of the Information Aggregation::User Information Access application pattern. To allow for LES to access the IBM Content Manager product, the IBM Enterprise Information Portal (now known as IBM Information Integrator for Content) is also required. Based on these product selections, the products in our scenario environment would map to the runtime nodes as follows: WebSphere Portal Experience V4.12 (including WebSphere Application Server v4) = Personalization Server, Presentation Server, Application Server IBM DB2 v7.1. = Database Server IBM Content Manager v7.1 = Content Management, Population Lotus Domino v5.10 = Collaboration, Content Management, Population Lotus Extended Search v3.7 = Search IBM Enterprise Information Portal v7 = Federation Chapter 10. Technical implementation of the scenario 169 Finally, we choose to utilize the built in LDAP capabilities of Lotus Domino to provide Directory and Security Services — resulting in a deployment of four physical nodes (servers) in our environment, as depicted in Figure 10-3. Domino Server Population Collaboration Directory and Security Services Federation Personalization Server Database Server Search Extended Search Server Presentation Server Content Management Application Server Portal Server Content Manager Server Figure 10-3 Chrisco Books scenario — physical nodes All products were installed on IBM series servers, running Windows 2000 Server (service pack 3) as the base operating system. 10.2 The Lotus Domino server As the scenario has Chrisco Books with an existing Domino environment, this server was installed first. We started with a generically installed Lotus Domino 5.10 server, and then made the following configuration changes to support the WebSphere Portal and other features required for this scenario: The LDAP task was loaded and set up to act as the central authentication directory for the scenario. This involved setting up the wpsadmin/wpsbind users, and wpsadmins group, with the appropriate access rights as required for a WebSphere Portal 4.1 installation. A “redbooks.nsf” database was created, populated with PDF files of redbooks, and added to the server. This database included a browser accessible search interface, to simulate the published books HTML site required for this scenario. 170 Patterns: Portal Search Custom Design A “projects.nsf” project tracking Notes database was created and added to the server to allow for the searching of Notes based data as required by the scenario. Note: Both of these databases were based on IBM ITSO organization applications, and are thus unavailable for download. However, these databases represented basic Lotus Notes/Domino applications. Any standard Notes client based, workflow enabled, database can be utilized in place of the projects.nsf we used in this scenario, and any standard browser client accessed database with search capabilities enabled can be utilized in place of the redbooks.nsf used in this scenario. 10.3 The IBM Content Manager server The next server installed was the IBM Content Manager server, as this was another of the technologies that Chrisco Books was supposed to have had in their environment prior to our scenario. Software for the CM server was installed in the following order — using the default installation values: IBM DB2 7.1 EE IBM WebSphere Application Server 3.02 Microsoft Visual C++ (Visual C++ is required to compile the database access libraries; the installation will fail if it is not installed.) Content Manager v7.1 IBM Enterprise Information Portal v7.1 Once the installshield installation processes were finished, the Content Manager database was loaded with sample PDF files for simulating the older books and images used by Chrisco books in our scenario. The PDF files utilized for this were simply IBM Redbook PDF files. This content was defined and loaded into the Content Manager database, and then set up for searching through the Enterprise Information Portal, via the steps given in the following sections. Define the metadata Here are the steps to follow: 1. Load the Content Manager Administration Client (login as frnadmin/password in our scenario). 2. Expand the LIBSRVRN tree, expand Fileroom Chapter 10. Technical implementation of the scenario 171 3. Create the following key fields to define the content metadata (Figure 10-4): a. b. c. d. e. f. g. h. i. j. k. l. m. BOOK_ABSTRACT (VARCHAR) BOOK_AUTHOR1_FIRST (VARCHAR) BOOK_AUTHOR1_LAST (VARCHAR) BOOK_AUTHOR2_FIRST (VARCHAR) BOOK_AUTHOR2_LAST (VARCHAR) BOOK_AUTHOR2_FIRST (VARCHAR) BOOK_AUTHOR2_LAST (VARCHAR) BOOK_ISBN (VARCHAR) BOOK_PAGES (INT) BOOK_KEYWORDS (VARCHAR) BOOK_PUBLISH_DATE (DATE) BOOK_PUBLISHER (VARCHAR) BOOK_TITLE (VARCHAR) Figure 10-4 Create Key Fields Create the search index and search template Here are the steps: 4. Create a new index class called BOOK to enable indexing of this content. 5. Assign the key fields created in step 3 to the BOOK index class (Figure 10-5). 172 Patterns: Portal Search Custom Design Figure 10-5 Assign key fields 6. Load the Information Integrator for Content Administration Client login as cmbadmin/password 7. Create a new federated entity BOOK with new federated attributes (Figure 10-6). a. b. c. d. e. f. g. h. i. j. k. l. m. ABSTRACT (VARCHAR) AUTHOR1_FIRST (VARCHAR) AUTHOR1_LAST (VARCHAR) AUTHOR2_FIRST (VARCHAR) AUTHOR2_LAST (VARCHAR) AUTHOR2_FIRST (VARCHAR) AUTHOR2_LAST (VARCHAR) ISBN (VARCHAR) PAGES (LONG) KEYWORDS (VARCHAR) PUBLISH_DATE (DATE) PUBLISHER (VARCHAR) TITLE (VARCHAR) Chapter 10. Technical implementation of the scenario 173 Figure 10-6 Create Federated Entity BOOK 8. From the federated entity properties dialog, click Map Federated Entity to map the key fields created in step 3 to the federated attributes created in step 7 (Figure 10-7). 174 Patterns: Portal Search Custom Design Figure 10-7 Map Federated Entity 9. Create a new search template called BooksAuthLast (Figure 10-8). 10.Add the following template criteria; add all operators for each item: a. Name: Auth1Last - Attribute AUTHOR1_LAST - Default Operator: like b. Name: Title - Attribute TITLE - Default Operator: like c. Name: ISBN - Attribute ISBN - Default Operator: equals Chapter 10. Technical implementation of the scenario 175 Figure 10-8 Create Search Template Import the PDF files Here are the steps: 11.Load Content Manager client for Windows; login as frnadmin/password. 12.Open the work baskets list; choose To Be Indexed. 13.From the file menu, choose Import. 14.Choose PDF file type. 15.Click Browse and choose file, then click Import. 16.Repeat for all files to be imported, then click Close (Figure 10-9). 176 Patterns: Portal Search Custom Design Figure 10-9 Import PDF Files 17.For each PDF imported and listed in the “To Be Indexed” work basket: a. b. c. d. Double-click the document in the work basket. Press CTRL-I to index the document. Choose the BOOK index class and fill in all the book meta-data attributes. Click OK to save. 10.4 The Lotus Extended Search server At this point, the Lotus Domino and IBM Content Manager servers have been installed, and are configured and ready to act as our key data and application servers for the scenario. Thus, it is now time to install the Lotus Extended Search server that will be brokering search requests from portal users, to the data and application servers, and then aggregating the results. Chapter 10. Technical implementation of the scenario 177 To start, Lotus Extended Search requires DB2 for its configuration database information, and IBM WebSphere Application Server for its administrative interfaces. So, DB2 7.2 FixPak 7 and WebSphere Application Server 4.04 were first installed on this server prior to starting the installation of Lotus Extended Search 3.7 (LES). A full version of LES was then installed on the server using the RMI option for communication between the various Extended Search components. Note: During the installation of WebSphere Application server, one has to decide how handle communications between the Extended Search server components. The two options available are a plain Remote Method Invocation (RMI) approach, and an Enterprise JavaBeans (EJB) approach. The main difference is that the RMI approach uses the Java Remote Method Protocol (JRMP) for communications, while an EJB approach uses the Internet Inter-ORB Protocol (IIOP). IIOP can be problematic for corporate firewalls and other security standards, and thus RMI is provided as a more universally supported option. When EJB communications are used, EJB support built in to WebSphere Application Server allows for support of the communication. However, if RMI is chosen, then the Extended Search RMI server must be started separately from the WebSphere server to allow communication to take place. If your extended search server is set for EJB, and EJB communication fails, then the servlets/ES server/etc will try to use RMI as a backup. However, in order for the Domino and Content Manager based data and applications to be searchable, LES components must be installed on those servers as well. The LES broker components on the main LES server communicate to LES components installed on Domino and Content Manager servers, so the LES components on these servers can access these servers product APIs to perform the searches. In the case of the Content Manager server, a “server only” install of LES was executed, as Content Manager requires a local LES broker to interact with the Enterprise Information Portal capabilities that then search Content Manager. On the Domino server, just an LES agent was installed. A “server only” install of extended search is performed using so_setup.exe from the Lotus Extended Search CDs. 178 Patterns: Portal Search Custom Design Note: Agents must be running on the same machine where either the Domino server or the Notes client software is installed (the Notes API permits remote access). We choose to install the LES agents on our Domino server, but alternatively could have installed a Notes client directly on our LES server. Please see Appendix B, “Understanding the Lotus Extended Search architecture” on page 207 for more detailed discussions of the various agent requirements and considerations. After all of the Extended Search components had been installed, LES “data sources” were created to allow for searching of the key data components required for the Chrisco Books scenario. These key search data sources for the Chrisco Scenario included: 1. A public internet search site, provided by google.com in our implementation of the scenario 2. An intranet site, provided by the redbooks.nsf database setup on the Domino server in our implementation of the scenario 3. A project tracking application, provided by the projects.nsf database setup on the Notes server in our implementation of the scenario 4. Data from a full content management system, provided by the PDF files loaded into our IBM Content Manager server in our implementation of the scenario Each of these data sources were then setup on the LES infrastructure as described in the rest of this section. 10.4.1 Internet and Intranet data source setup Both the Google Internet source, and the intranet source, are set up via standard LES Web sources. LES is actually preconfigured with data source definitions for many popular internet sites, including Google — so the Google Web source was already enabled. However, to make a Web source definition for our intranet site, we were required to create a Web source definition file for this new Web source. The first step in creating a Web source definition file is to contact Intelligent Algorithms Enterprises, Ltd., to obtain an infoGIST toolkit that enables you to build your own Web source definition file (.sbb file). http://www.infogist.com/lotus.htm Chapter 10. Technical implementation of the scenario 179 In order to create the SBB file, the InfoGist SBB Authoring toolkit needs to be obtained from Infogist and installed on a development machine. After doing this, we used the steps in the following section to create our custom Web source: Step 1: Create the SBB file Here are the steps to follow: 1. Identify the search page URL for the Web source to include. In our scenario, we utilized the redbooks.nsf URL on the Domino data server. 2. Launch the InfoGist SBB Authoring tool. Figure 10-10 Sample InfoGist Authoring tool user interface (with existing search targets) 180 Patterns: Portal Search Custom Design 3. Select the menu item SearchBot, Add. 4. Click OK when prompted for the searchBot ID. 5. Enter the SearchBot definition as shown in Figure 10-10. Figure 10-11 SearchBot Definition Chapter 10. Technical implementation of the scenario 181 6. Click the SearchBot Forms button and insert a new form. The details of the form are shown in Figure 10-12. Figure 10-12 Search Form Definition 7. When complete, click OK until you return to the searchBot definition page. 182 Patterns: Portal Search Custom Design 8. Define the Follow/Skip rules by clicking the Follow/Skip button and entering the information specified in Figure 10-13. Figure 10-13 Follow/Skip Rule definition Chapter 10. Technical implementation of the scenario 183 9. Next, specify the search syntax by clicking the Search Syntax button and completing it as shown in Figure 10-14. Figure 10-14 Search Syntax configuration 10.Validate the configuration by performing a search from within the SBB Authoring tool. To initiate the search, enter the search parameter and select the Click Here To Search button located to the right of the search input field. Figure 10-15 shows the results of an example search. 184 Patterns: Portal Search Custom Design Figure 10-15 Search execution within the InfoGist toolkit Step 2: Deploy the SBB file to the LES Server In order to deploy the SBB file to the LES server, copy the file to the server’s base directory (for example, <drive>:\Program Files\IBM\Extended Search) Step 3: Discover the Web Sources Through the ES administrative interface, follow these steps: 1. Click Servers in the administration interface navigator. 2. Discover the new data source, by right-clicking the server icon (where the SBB file was deployed) and selecting Discover Data Sources. Refer to Figure 10-16. Chapter 10. Technical implementation of the scenario 185 Figure 10-16 Discovering a data source in Lotus Extended Search Refer to Figure 10-17 for the following sequence: 3. Enter the name of the data source. 4. Click the Start Discovery button. 5. Select the source from the list generated/discovered. 6. Click the Add to ES button. 186 Patterns: Portal Search Custom Design Figure 10-17 Configuring the Web Source discoverer Step 4: Configure the Web Sources Link Here are the steps to follow: 7. Select the Links option on the navigator. 8. Right-click the Web Sources link and select Properties (Figure 10-18). Chapter 10. Technical implementation of the scenario 187 Figure 10-18 Update the Web Source Link Refer to Figure 10-19 for the following sequence: 9. Click the second tab (Parameters) and enter the name of the discovered SBB file in the ESWebConfig link parameter value column. Make sure to separate the different files with a “?” as shown in the diagram below. 10.Click the Apply button. 188 Patterns: Portal Search Custom Design Figure 10-19 Configure the ESWebConfig Parameter 11.Finally, propagate the changes, and restart the LES server. At this point, the new Web source based on the custom SBB definition file should be available. 10.4.2 Domino application data source setup Domino data sources are easily set up within Lotus Extended Search. To create the Domino Data source for our projects.nsf Notes application, we performed the following steps: 1. In the ES Admin applet, choose the primary server. Right-click and choose Discover Data Sources. 2. Choose Lotus Notes for the type of source to discover. 3. Enter the Domino server name and the domino hostname. Note: Include the http port number if using a port other than 80. 4. If the Domino server is not on the same physical machine as Extended Search server, uncheck the box labeled, “Is this Domino server located on the local host?” 5. Uncheck “Load these databases with a skeleton set of fields.” 6. Choose Lotus Notes 5.0 for the link name. 7. Click Start Discovery. The LES server will then communicate with the remote LES “agent” (as installed earlier on the Domino server) and bring back a list of all available databases on the server to search. 8. Choose the database from the list and click to add this source to Extended search. Chapter 10. Technical implementation of the scenario 189 10.4.3 IBM Content Manager data source setup As discussed earlier, Content Manager is actually accessed by LES through the Enterprise Information Portal (Information Integrator for Content). An LES broker was installed on the Content Manager/EIP server so that LES can communicate with EIP. EIP in turn is configured to search Content Manager via its own federated search capabilities. Thus, to configure LES to search the IBM Content Manager in our scenario, we first need to configure the connection to the LES broker installed on the Content Manager server, and then set up the EIP data sources in LES. To configure the connection between the LES server, and LES components on the Content Manager server, the following steps are performed. This steps must be performed prior to starting the LES components installed on the Content Manager server: 1. Ensure that the LES server is running, and open the Extended Search Administration applet. 2. Create a new Extended Search server in the Admin applet (Figure 10-20). Figure 10-20 Creating a new extended search server 3. Connect the Primary Extended Search server, installed on our LES node, to the Extended Search server installed on the Content Manager/EIP server (Figure 10-21). 190 Patterns: Portal Search Custom Design Figure 10-21 Connecting the LES servers 4. Enable basic authentication on Extended Search servers via the Admin applet (Figure 10-22). Figure 10-22 Enabling basic authentication between the servers 5. Map user IDs from within the EIP Admin application: a. From within the EIP Administration client, create a user to be used by the LES server as it accesses Content Manager through EIP. Note: For a production system, each LES user must be mapped to an EIP user in this manner. This limitation is removed in Extended Search 4.0. We utilized Extended Search 3.7 in our scenario. Chapter 10. Technical implementation of the scenario 191 b. Check the box labeled “Allow access from Extended Search” and enter the user name which you created in Step a. (see Figure 10-23). Figure 10-23 Enabling LES access for Information Integrator for Content users c. Close the EIP Administration client. 6. Rename Agents & Brokers on EIP Extended Search Server. a. In the Extended search admin applet, open the properties for the LES server installed on the Content Manager/EIP server. b. Rename the Broker and Agents to identify them as EIP/Content Manager specific (Figure 10-24). 192 Patterns: Portal Search Custom Design Figure 10-24 Server Properties 7. Start EIP based Extended Search Server. At this point, the LES components running on the Content Manager server have been properly configured and started. The EIP data source can then be created, as follows: 1. In the Extended Search Admin applet, choose the EIP server, right-click and choose Discover Data Sources. 2. Select IBM Enterprise Information Portal as the type of source to discover. 3. Enter the content manager database, “cmbdb” in our scenario. 4. Enter the user id for this database, “cmbadmin” in our scenario. 5. Enter the password. 6. Click Start Discovery. Chapter 10. Technical implementation of the scenario 193 7. Choose the BOOK federated search object that was set up in EIP during the Content Manager server install, and choose Add to ES (Figure 10-25). Figure 10-25 Information Integrator for Content discovery setup Creating an LES application After all the data sources have been setup, the final step is to create an Extended Search “application” that associates all these data sources into a single “federated” search. 1. In the ES Admin applet, choose Applications in the left pane, right-click and choose New. 2. In the dialog enter an application name and description. 3. Choose Broker as the entry broker (Figure 10-26). 194 Patterns: Portal Search Custom Design Figure 10-26 Application Properties 4. Click OK to close the dialog. 5. Add Data Sources to the new search application: a. In the Extended Search Admin applet, choose Applications in the left pane. b. In the right pane is a list of Categories. Expand the section of the tree labeled, “[All Categories]”. c. Right-click Domino Sources, and choose Copy. d. Right-click the Books application and choose Paste. The Domino data sources are added to the Books application. e. Do the same for the EIP and Intranet data sources. Chapter 10. Technical implementation of the scenario 195 10.5 The WebSphere Portal server Finally, the WebSphere Portal server was installed in our scenario environment to pull all of the other technologies together into a seamless interface. Since the scenario calls for the usage of Lotus Extended Search and IBM Enterprise Information Portal, WebSphere Portal v4.1 “experience” was required. During the installation “Setup Manager” process, the portal was configured with the following options: – Standard Install – Install Components: • • • • WebSphere Portal WebSphere Personalization WebSphere Application Server IBM HTTP Server – Use Local DB2 Database – Database & LDAP Directory – Use Domino LDAP Once the WebSphere Portal was installed and up and running, a custom portal theme was applied, and search functionality was added to a new page group called "Search”. To create the search page group, the following steps were performed (Figure 10-27): 1. Log in as a portal administrator. 2. Click the Work With Pages link. 3. Click Manage Places and Pages, then Create Place. 4. Enter the place name Search, choose the ITSOSearch theme, and click OK. 5. Select the Search place and click Manage Place. 6. Choose Create Page, then Create New. 7. Enter the page name Search, and choose a layout. Click OK. 8. In the Manage Places portlet, click Done. 196 Patterns: Portal Search Custom Design Figure 10-27 Create the portal theme Next, the Search page group was customized, and the out-of-the-box Advanced Extended search portlet was added as our search interface. Note: We could have created a custom search portlet leveraging the Lotus Extended Search API for our scenario. However, we choose to utilize the out-of-the-box portlet to simplify the scenario. 1. Log in as a portal administrator. 2. Click Work With Pages. 3. Click Edit Layout and Content. 4. In the drop-down list, choose the Search place and Search page. 5. Click Get Portlets. 6. In the “Name Contains” field, enter Search and click Go. 7. In the search results, click the plus symbol next to “Extended Search Advanced Portlet”. 8. Click OK. Chapter 10. Technical implementation of the scenario 197 9. Choose Extended Search Advanced Portlet. 10.Click the plus symbol in the page layout frame to add the portlet to the page. 11.Click Activate. 12.Switch to the new Search page. 13.Click the configure icon for the Extended Search portlet, and enter the correct hostname for the Extended Search server. 14.Configure the search portlet (Figure 10-28): a. Log into the portal as wpsadmin. b. Go to the Search page. c. Click the EDIT button for the Advanced Search portlet. d. Click Change Application name or Server URL. Figure 10-28 Advanced Extended Search portlet configuration options 198 Patterns: Portal Search Custom Design e. Enter books for the application name, as this is the LES “application” we set up earlier with our scenarios data sources defined. f. Edit the URL, and change localhost to the address of the Extended Search server. Do not change the rest of the URL (Figure 10-29). Figure 10-29 Setting the LES application name g. Click OK, then Done. 10.6 Putting it all together At this point, all of the servers are installed and configured, and the scenario can be verified by accessing the WebSphere Portal, and performing a search within the Extended Search portlet. Overall, the Lotus Extended Search server was the real “workhorse” of this Portal search solution, as it brokers the search requests out to the other search engines built into Content Manager (that is, Information Integrator for Content/EIP) and Lotus Domino. The results are returned to Extended Search server where they are aggregated, ranked, and sorted — and then returned to the portlet as a single hit list of results from all data sources. Chapter 10. Technical implementation of the scenario 199 The out-of-the-box portlet we utilized in our scenario provides a basic interface for the user to do a search, as shown in Figure 10-30. Figure 10-30 Basic portlet search UI The portlet also allows the user to specify which sources to search, by choosing the “advanced” option, as shown in Figure 10-31. By default, all sources are searched. Figure 10-31 Selecting the sources 200 Patterns: Portal Search Custom Design The search results are returned to the portlet in an aggregated and ranked fashion, as shown in Figure 10-32. These search results show results from both Google and the internal redbooks.nsf search site aggregated. However, the search results interface provided in this out-of-the-box portlet is not the most intuitive. Any real world deployment of this scenario would obviously consider the usage of a custom portlet — with a more user friendly view of the search results. Figure 10-32 Example search results Chapter 10. Technical implementation of the scenario 201 202 Patterns: Portal Search Custom Design Part 5 Part 5 Appendices © Copyright IBM Corp. 2004. All rights reserved. 203 204 Patterns: Portal Search Custom Design A Appendix A. Pattern changes With the publication of this redbook, several changes have occurred to the Application Integration and Information Aggregation application patterns. Some patterns have been renamed, some have been discontinued, and new patterns have been introduced. In general, these changes were made to more clearly represent the “data focused application integration” capabilities provided by some application patterns that were previously considered part of the Information Aggregation business pattern — by moving these patterns to the more accurate category of Application Integration patterns. In other words, data based application integration is more “integration” focused, than “business” focused, and thus more correctly belongs within an Integration pattern. Additionally, in the process of making these changes, names of Application patterns were also modified in some cases to better identify their capabilities. To help clarify these changes, Table A-1 provides a mapping of each new application pattern name to the older application pattern name used in previous Patterns for e-business IBM Redbooks. © Copyright IBM Corp. 2004. All rights reserved. 205 Table A-1 Information Aggregation and Application Integration pattern changes Old Pattern Name(s) New Pattern Name(s) Information Aggregation::Information Access Information Aggregration::User Information Access (UIA) Information Aggregation::Information Aggregation plus Limited/Extended Update Information Aggregration::User Information Access (UIA) - Immediate update variation Information Aggregration::User Information Access (UIA) - Batched update variation No prior name, new pattern Information Aggregration::User Search and Discovery (US&D) Application Integration::Federated Repository Application Integration::Federation Information Aggregation::Population Single Step Application Integration::Population: Single Step Information Aggregation::Population Multi-step Application Integration::Population: Multi-step variation No prior name, new variation Application Integration::Population: Data Cleansing variation Information Aggregation::Replication Application Integration::Population: Synchronization Information Aggregation::Population Crawl and Discovery Application Integration::Population: Index Information Aggregation::Population Summarization 206 Patterns: Portal Search Custom Design B Appendix B. Understanding the Lotus Extended Search architecture This appendix provides some background details on the Lotus Extended Search architecture that should be helpful to any IT professional depoying this technology as part of a Portal Search Custom Design. Overall, the distributed component architecture of Extended Search offers the flexibility to scale a system according to changing requirements. It also allows the Extended Search components to be arranged in a topology that matches any environment, enabling a blend of IBM AIX, Sun Solaris, Windows 2000, and Windows NT® platforms as needed. The architecture supports vertical and horizontal scalability: Vertically, within a single Extended Search server, you can configure multiple instances of server processes to influence the number of simultaneous requests that the server can process. Horizontally, with multiple machines, you can set up additional Extended Search servers and additional Web servers. For each Extended Search server, you can determine the types of server tasks you want to run. By having multiple servers, you can distribute and balance the processing load. © Copyright IBM Corp. 2004. All rights reserved. 207 Extended Search architecture The Extended Search system employs a four-tiered architecture (Figure B-1). Messages start from search applications in the first tier and proceed consecutively through subsequent tiers to the back-end. In most cases, the back-end is a third-party data source to which Extended Search is connected; but it can also be the Extended Search configuration database (CDB), a private back-end that is managed by DB2. 1st Tier 2nd Tier 3rd Tier 4th Tier ES Server Run Time Browser Notes Client Applet Web Server App Server Web Server Web Server App Server Broker Agent Link Backend Broker Agent Link Backend Data Source Discoverer Backend Admin ES CDB Applet Web Server App Server RMI Server CDB Figure B-1 Extended Search tiered architecture Message flows between the tiers can be divided into two basic categories: Run time messages, shown above the dotted line in the preceding diagram, are messages usually issued by the user community to perform searches and retrieve documents. Administrative messages, shown below the dotted line, are issued by the Administrator and result in updates to the configuration database. Run time messages can be submitted either through a standard Web browser or a Lotus Notes client program. Administrative messages are always submitted through the Extended Search Administration interface. The horizontal bars in Figure B-1 indicate the consecutive components through which each message must flow during its journey from the first tier through the fourth tiers and back again. Each of these components is described in the following sections, starting from the right side of the diagram and moving to the left. 208 Patterns: Portal Search Custom Design Links and translators Extended Search links are the software modules that encapsulate the native API calls for search and retrieval to a specific data management system. They contain all of the required data structures, programming objects, and procedural logic necessary to interface with the back-end data system. A link module is uniquely assembled to support (at a minimum) four callable methods that typically exist in all data management systems: Methods to connect to and disconnect from the host system Methods to search content and retrieve data from the system The link module performs a null operation for those methods that are not supported by the back-end source. For example, a file system search does not support the concept of connecting and disconnecting. Extended Search translators are the software modules responsible for translating the incoming GQL expression into the native search grammar of the back-end data system. They, too, contain all of the required data structures, programming objects, and parsing logic necessary to generate a syntactically correct search expression. In some cases, the same translator module applies to several different back-end systems, as is the case for the SQL translator and the many varied systems that support the standard SQL grammar. Extended Search comes with a broad set of link and translator modules that enable you to connect to most of the industry’s common data management systems. If your data system is not contained in this standard set, you can develop a custom link or translator module by using an easy-to-use toolkit provided with the product. Agents Extended Search agents are programs that respond to search and retrieval operations targeted against a particular data source. The agent loads the appropriate link and translator modules when a request against a specific data source type is first made. The agent then calls upon these module libraries for translation (XLAT), connect, disconnect, search, and retrieval operations. Appendix B. Understanding the Lotus Extended Search architecture 209 Figure B-2 illustrates the interaction of the agent with a given back-end system. Connect Search/Fetch Request Response (Sorted by Rank, Pruned to MaxHits) X X L X L A L A T A T T L L I L I N (NT/AIX) I N K (NT/AIX) N K K Agent Agent Agent Xlat/Link libraries loaded for each LinkType Search/Fetch Search Engine Disconnect (Web, File, etc...) DataSource User Exits Figure B-2 Extended Search agents For search operations, an agent will sort the results set by relevance rank and then truncate the set to the maximum number of hits, as specified in the original search request. This sorting and subsequent pruning of the list of hits is an important precursor to aggregation, which will be discussed shortly. Agents can reside on the same machine as the data source (recommended) or use a data source’s remote APIs for access. More than one copy of an agent can run on a single computer to handle concurrent search and retrieval requests. An agent can be dedicated to a single data source, a group of sources of a particular type, or a range of sources that have a mixture of link requirements. To be able to discover and search certain types of data sources, the Extended Search Server component, including an agent, must be installed on the same machine with the data source being discovered or searched. Note that you are not required to install the Web server and configuration database components on a remote machine, but you may need to install the base server software to ensure that an agent is locally available to the remote target sources. Table B-1 details the requirements for accessing remote data sources and identifies those products (for which support is predefined in Extended Search) that require a local agent for searching. 210 Patterns: Portal Search Custom Design Table B-1 Agent location requirements Product Agent requirements File systems Agents must be running on the same machine as the directories being searched. IBM Enterprise Information Portal Agents must be running on the same machine as the Information Integrator for Content federated server. LDAP Agents can search any LDAP server from any Extended Search machine (LDAP is completely remote). Lotus Connectors Agents must be running on the same machine as the Domino server that hosts the Connectors software. Lotus Domain Index Agents must be running on the same machine where either the Domino server or the Notes client software is installed (the Notes API permits remote access). Lotus Domino.Doc Agents must be running on the same machine where the Domino.Doc Desktop Enabler is installed (the Domino.Doc COM API permits remote access). Lotus Notes Agents must be running on the same machine where either the Domino server or the Notes client software is installed (the Notes API permits remote access). Microsoft Access Agents must be running on the same machine where the Access database (.mdb file) is installed. MDAC 2.5 or higher must also be installed. Microsoft Exchange Server Agents must be running on the same machine where the Exchange Server software is installed. MDAC 2.5 or higher must also be installed. Microsoft Index Server Agents must be running on the same machine where the Index Server software is installed. MDAC 2.5 or higher must also be installed. Appendix B. Understanding the Lotus Extended Search architecture 211 Product Agent requirements Microsoft Site Server Agents must be running on the same machine where the Site Server software is installed. MDAC 2.5 or higher must also be installed. Microsoft SQL serve Agents can be running on the any machine where MDAC 2.5 or higher is installed (the SQL Server API permits remote access). ODBC — Access Agents must be running on the same machine where the Access database (.mdb file) is installed. MDAC 2.5 or higher must also be installed. ODBC — DB2 Agents must be running on the same machine where the DB2 client, at a minimum, is installed (remote access is possible as long as the DB2 client is available). ODBC — Oracle Agents must be running on the same machine where the DB2 client, at a minimum, is installed (remote access is possible as long as the DB2 client is available). ODBC — SQL Server Agents must be running on the same machine where ODBC 3.0 is installed (access is completely remote). Brokers Extended Search brokers are intermediary components that exist between the requestors of service and the agents that actually perform the service through the back-end. They function as special purpose resource coordinators designed to manage the multitude of searches generated from a single request – as caused by a category search, for example. Figure B-3 illustrates the functionality performed by an Extended Search broker. 212 Patterns: Portal Search Custom Design Shared Memory 1 Request Broker Broker Search/Fetch Response 2 4 Broker JSP Hitlist 1.0 1 3 Security & Logging Exits JSP Hitlist 1.1 2 JKM Hitlist 3 Cached Search Results Local and/or Remote Data Sources Figure B-3 Extended Search Broker A broker typically performs the following tasks: Validates the request. Expands categories to obtain a list of the data sources available to the application and resolves the source addresses. (Label 1) Distributes queries to agents for efficient, parallel searching. (Label 2) Aggregates and optionally sorts search results that are returned by the various agents into a single search result set. (Label 3) Caches search results for subsequent paging operations. (Label 4) Issues requests to agents to retrieve source documents for the user (note that in most cases, the Web browser uses the URL returned in the results list to retrieve the document). Honors timeouts and response options. The degree of responsiveness can vary dramatically from a large set of back-end systems contributing to a single request. Some data management systems respond faster than others, and some not at all – possibly due to out of service conditions. To account for this situation, brokers were designed to communicate asynchronously with their agents. This design allows a broker to not be dedicated to any one particular back-end data source, and it enables the user to assign a timeout value to the request. When a timeout threshold has been reached, the broker returns whatever results have been compiled up until that point. Appendix B. Understanding the Lotus Extended Search architecture 213 Additional options let you control how the broker returns results. Two such options are to return the results when they are available or after they have been sorted. If you specify the “When available” option, the broker will return the results in the order that the sources respond to the query. This approach provides a fast way to see the results of your search, but there is no guarantee that the first results you see will be the most relevant results. If you specify the “Sorted” option, the Broker will collate all the results, sort them according to additional options you specify, and eliminate duplicate references before returning the results to you. This approach usually takes longer than obtaining results as they are available, but the results may be more relevant to your query. Multiple brokers To support performance and scalability, a given Extended Search domain can contain multiple brokers. This ability to establish a hierarchy of brokers, along with the ability to set up agents co-resident with the sources they support or to dedicate agents to particular sources or types of sources, provides Extended Search with endless flexibility with regard to changing and expanding environments. Under a multiple broker schema, sources get partitioned across all of the brokers, a design that prevents any one broker from being overwhelmed. In a single broker environment, a search that targets six dozen sources would result in 72 queries being sent to the remote machines and 72 sets of search results being returned to the broker. If each result set contains the maximum number of results, most of the data will be discarded when the broker consolidates and aggregates the data for the list being returned to the requestor (the broker prunes the results and keeps only the top items, up to the maximum number allowed by the search application). With multiple brokers, an entry broker sends a single message to brokers on remote machines. The remote brokers then split the message into multiple requests for the sources (fronted by agents) on their respective machines. Instead of all result sets being returned to one broker, each broker consolidates, aggregates, and prunes the results returned by its agents, and then returns just a single list – containing the top hits – to the entry broker. The entry broker only needs to create a final results set from its own local sources and the consolidated lists returned by the remote brokers. This design enhances overall performance (less bandwidth is needed for broker-to-broker communication as compared to that needed to communicate with hosts that lack brokers) and it allows new sources, regardless of location, to be easily integrated into an existing domain. 214 Patterns: Portal Search Custom Design Configuration database The broker obtains information about the resources it is to manage from the Extended Search configuration database. This database contains information about data sources and how they should be searched. It also stores network addresses, saved queries, saved search results, and data that was downloaded by a Web crawler. You can easily update information about your network topology, data sources, and search applications by using an intuitive Administration interface. This interface also provides the gateway through which you can run discovery (discussed below), view error message and event data, schedule queries, and work with saved queries and search results. Several wizards facilitate common configuration activities. The wizards enable you to easily export and import data between domains, design the format and content of search result sets, specify data source search and retrieval parameters, and configure mapped fields. Note that a simple refresh action will disseminate changes you make in the CDB throughout the Extended Search domain. The only time you need to restart the server is when you update configuration data for the server itself. Discovery To add data sources to your domain, Extended Search provides a collection of discoverers, programs that load the CDB with default information about a data source. The discovery process greatly facilitates the configuration process by automatically configuring field and parameter information for each new data source. Later, using the Administration interface, you can designate which fields you want to enable for search and retrieval operations. The discovery process is also able to ascertain whether or not a particular data source has been previously loaded into the configuration database. The discoverer skips already defined sources on subsequent invocation. This is true even if the data source name changes. Extended Search comes with a broad set of discoverers that enable you to quickly incorporate many of the industry’s common data management systems. Like links and translators, if your data system is not contained in this standard set, you can develop a custom discoverer by using the Extended Search toolkit. Appendix B. Understanding the Lotus Extended Search architecture 215 Monitoring To help you collect statistics and fine-tune the system for performance, Extended Search includes a Monitor, and tool that enables you to observe server activity through a graphical user interface. The Monitor is packaged as a standalone C++ program and as a Java applet that you can launch from within the Administration interface. This feature enables you to make adjustments and refresh the system without having to restart the server. The Monitor can run independently of the broker, and be started and stopped any number of times, without affecting work being done by the Extended Search server. Because it can run remotely, you can quickly check the status of various servers from a location other than the host console. Environment Because Extended Search is designed to use existing software to search for information and retrieve data from wherever it exists throughout an organization, it must integrate well with the existing IT infrastructure. To this end, an Extended Search domain supports a mixed topology. Extended Search server components (brokers, agents, and so on) can reside on IBM AIX, Sun Solaris, Microsoft Windows 2000, and Microsoft Windows NT platforms, and you can mix the component topology as needed to satisfy the requirements of your operating environment (Figure B-4). 216 Patterns: Portal Search Custom Design Figure B-4 A typical Extended Search environment As shown in the preceding illustration, users can submit requests through a Web browser or a Lotus Notes client — interfaces that they are already familiar and comfortable with. This design allows Extended Search to provide a distributed search across many different data repositories through a single, efficient, and easy to use point of access. All user requests get sent to the Web server, which in turn forwards the request to the appropriate Extended Search broker. The broker, in turn, contacts the agents needed to carry out the request and search the various target sources. When access is through a Web browser, information about the search (what sources to search, how to search them, and how results should be returned) is determined by the HTML or JavaServer Pages that define the search application. Appendix B. Understanding the Lotus Extended Search architecture 217 When access is through Notes client software, information about the search is stored in a search application database, which can either exist on a Domino server or be replicated down to the user’s workstation. Note that Extended Search uses the Hypertext Transfer Protocol (HTTP) to invoke the appropriate servlet for processing requests. This approach has some advantages: It allows the search application to use an industry-standard protocol (HTTP). This enables the application to use many Web server-related features such as support for socks, proxies, and secure sockets layer (SSL) technology. It allows servlets to communicate with an Extended Search server that resides on a machine other than the Web server. This provides for added flexibility when resource capacity and performance are of concern. 218 Patterns: Portal Search Custom Design C Appendix C. Using the WebSphere Portal Search Engine The WebSphere Portal Search Engine was not utilized in the search scenario in this redbook. However, it is a powerful basic search utility that has been provided with even more capabilities with each release of WebSphere Portal. As a guideline for implementing this technology in your own solutions, this appendix describes the setup of the Portal Search Engine within a WebSphere Portal 4.12 environment. Details on the setup and usage of the updated Portal Search Engine in WebSphere Portal v5 can be found in the Portal v5 Infocenter: http://publib.boulder.ibm.com/pvc/wp/500/ent/en/InfoCenter/wps/admsrch.html Additional details on the Portal Search Engine in WebSphere Portal v4.21 can be found in the Portal v4 Infocenter: http://publib.boulder.ibm.com/pvc/wp/42/ext/en/InfoCenter/wps/admsrch.html © Copyright IBM Corp. 2004. All rights reserved. 219 How to set up Portal Search in WebSphere Portal Server Setting up Juru search or document search for your Portal would require: 1. Creating the Search page. 2. Building an index. 3. Setting up security. 4. Configuring the crawler.properties (optional). Creating the Search page You need to create a page that will contain the Document Search and Manage Search Index portlets. Let us create a sample search page: 1. Log on to the portal as the Administrator (wpsadmin). 2. First we need to create a copy of the Document Search portlet, which we can then use on our Search page. Select Portal Administration ->Portlets ->Manage Portlets Note: It is recommended to create another instance of the Juru Search portlet, because this portlet can be used to search on a single index. 3. From the list of portlets, select Juru Search and then click Copy. Figure C-1 Create Copy of Juru or Document Search Portlet 220 Patterns: Portal Search Custom Design 4. Provide a name for the new portlet instance; for example, My Juru Search and then click OK. 5. The new portlet is not activated by default. So, select it from the list of portlets and then click Activate/Deactivate. 6. Click Modify parameters. This option allows you to specify the search index. Specify the Index Location parameter, for example, in the case of UNIX, /var/PortalServer/indices/index1, or in the case of Windows, C:\temp \index1, depending upon the platform on which the Portal is installed. This is the name and location of the index that we will create later on. Now, click Save. 7. Select the Work with Pages option. Click Manage Places and Pages and then select Create place. 8. Provide a Place name and default locale title for the place; for example, Juru Search. Then, click OK. 9. From the list of Places you can manage, select Test and then click Manage pages. 10.Click Create page -> Create new 11.Provide a name for the page, select a Layout, and then click OK. 12.Select Edit Layout and Content, and select the Place as Test and the Page as Search. 13.Click Get portlets. Select either Show all portlets or Search for portlets using the keyword search. Click Go. 14.From the list of portlets returned, select My Document Search and Manage Search Index portlets by clicking the Add to list (+) button besides them. Then, click OK. 15.You can edit the layout of the Search page and then add the selected portlets to the page. Click Activate. Building a Juru Index The Manage Search Index portlet can be used to build and maintain indices of Web content that will be used by the search portlet. The search index stores key words and terms and maps them to their source documents, enabling fast processing of requests from the search portlet. During the build process, documents are retrieved for indexing through a Web crawler (robot). Searchable resources can be stored on the local portal server or on remote sites. Users can search HTML and text documents: 1. Log onto the portal as the Administrator (wpsadmin) and then navigate to the search page that we created, for example, Test -> Search. Appendix C. Using the WebSphere Portal Search Engine 221 2. On the Manage Search Index Portlet, click the Configure search index option. 3. Specify the following values for configuring our index: – Location of the index as /var/PortalServer/indices/index1 or C:\Temp\SampleIndex – Task for configuring the index as New Index – Starting URL as http://w3.itso.ibm.com/ or any URL that would be the base URL for your index – The option to enable CJK language support enables support for Chinese, Japanese, and Korean languages. We do not require this option. – Document types to be indexed as both HTML and text. – Levels of linked documents should be at least 1. – Number of linked documents to index can be retained as 100. 4. Click OK to save the configuration and then click Done. 5. Now click the Manage search index option on the Manage Search Index Portlet. 6. From the list of indices, select the index that we just configured (C:\Temp\SampleIndex) and then click Begin index update. Once the index has been built, if you re-visit the Manage search index option (or click Refresh on the browser) you will see the statistics for Last update completed at and Number of active documents updated. 7. Click Done. Setting up permissions There are two basic tasks that are required to be completed before the Search feature can be made available to a portal user: Portal users should be provided View access to the Search page. The Manage Search Index portlet should not be accessible to users other than the Administrator. Here are the steps to accomplish these objectives for our Search page: 1. Log onto the portal as the Administrator (wpsadmin) and then click Portal Administration -> Security. 2. For the Select a group or user to assign permissions field, select Special groups -> All authenticated users. 3. Select pages for Select the objects for the permissions field. Click Go. 222 Patterns: Portal Search Custom Design 4. Provide View permissions for the Test place and Search page. Click Save. 5. Now, select portlets for Select the objects for the permissions field. Provide search as the keyword for the Search On -> Name contains field and then click Go. 6. Provide View access for the My Document Search portlet and None for the other portlets. 7. You can now log out and then log onto the portal as an ordinary user. Configuring the crawler The index build process is optimized for crawling inside an Intranet. If you need the crawler to fetch documents on the other side of a firewall, you need to update the crawler.properties file (located in the index directory). You can set either the name and port of a proxy server or a socks server. For example: Example: Proxy settings for the crawler #The name of the socks server to be used <server name>: #<port number>server-name>: SocksServer=socks.yourco.domain \:1080 #The name of the proxy server to be used <server name>:<port number> ProxyServer=proxy.yourco.domain \:80 You can specify additional URLs (maximum of nine) crawled into the same index. Example:Additional sites to be indexed #OtherRoot1=http \://www.second.site #OtherRoot2=http \://www.third.site ... #OtherRoot9=http \://www.last.site Appendix C. Using the WebSphere Portal Search Engine 223 224 Patterns: Portal Search Custom Design Related publications The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this redbook. IBM Redbooks For information on ordering these publications, see “How to get IBM Redbooks” on page 227. Patterns: A Portal composite pattern using WebSphere Portal V4.1.2, SG24-6869 Patterns: A Portal composite pattern using WebSphere Portal V5, SG24-6087 Access Integration Pattern Using WebSphere Portal Server, SG24-6267 Applying Pattern Approaches, SG24-6805 Patterns: Self-Service: Connecting to the Enterprise, SG24-6572 IBM WebSphere Everyplace Server Service Provider and Enable Offerings: Enterprise Wireless Applications, SG24-6519 IBM WebSphere Everyplace Server: A Guide for Architects and Systems Integrators, SG24-6189 Applying the Patterns for e-business to Domino and WebSphere Scenarios, SG24-6255 Mobile Applications with IBM WebSphere Everyplace Access Design and Development, SG24-6259 Self-Service Patterns using WebSphere Application V4.0, SG24-6175 Self-Service Applications Using IBM WebSphere V4.0 and MQSeries Integrator, SG24-6160 Patterns for the Edge of Network, SG24-6822 Web Services Wizardry with WebSphere Studio Application Developer, SG24-6292 IBM WebSphere V4.0 Advanced Edition Handbook, SG24-6176 Java Connectors for CICS: Featuring the J2EE Connector Architecture, SG24-6401 MQSeries Programming Patterns, SG24-6506 © Copyright IBM Corp. 2004. All rights reserved. 225 WebSphere Version 4 Application Development Handbook, SG24-6134 IBM WebSphere V4.0 Advanced Edition Handbook, SG24-6176 IBM WebSphere Portal Developers Handbook, SG24-6897 IBM WebSphere Portal V4.1 Handbook, Volume 1, SG24-6883 IBM WebSphere Portal V4.1 Handbook, Volume 1, SG24-6920 IBM WebSphere Portal V4.1 Handbook, Volume 3, SG24-6921 WebSphere Portal 4.12 Collaboration Services, REDP0319 IBM WebSphere Portal V4.1.2 in a Linux Environment, REDP0310 WebSphere Portal V4.1 AIX 5L Installation, REDP3594 WebSphere Portal V4.1 Windows 2000 Installation, REDP3593 Other resources These publications are also relevant as further information sources: Patterns for e-business: A Strategy for Reuse by Jonathan Adams, Srinivas Koushik, Guru Vasudeva, and George Galambos, ISBN 1-931182-02-7 Flanagan, David, JavaScript: The Definitive Guide, Third Edition, O'Reilly & Associates, Inc., 1998, ISBN 0-596-00048-0 Maruyama, Hiroshi, Kent Tamura and Naohiko Uramoto, XML and Java: Developing Web Applications, Addison-Wesley 1999, ISBN 0-201-77004-0 Flanagan, David, Jim Farley, William Crawford and Kris Magnusson, Java Enterprise in a Nutshell, O’Reilly & Associates, Inc., 1999, ISBN 0-596-00152-5 Subrahmanyam Allamaraju et al, Professional Java Server Programming J2EE Edition, Wrox Press, 2001, ISBN 1-861004-65-6 Referenced Web sites These Web sites are also relevant as further information sources: Patterns for e-business Web site: http://www.ibm.com/developerWorks/patterns/ IBM Redbooks internal Web site: http://w3.itso.ibm.com 226 Patterns: Portal Search Custom Design How to get IBM Redbooks You can order hardcopy Redbooks, as well as view, download, or search for Redbooks at the following Web site: ibm.com/redbooks You can also download additional materials (code samples or diskette/CD-ROM images) from that site. IBM Redbooks collections Redbooks are also available on CD-ROMs. Click the CD-ROMs button on the Redbooks Web site for information about all the CD-ROMs offered, as well as updates and formats. Related publications 227 228 Patterns: Portal Search Custom Design Index Symbols .NET 146, 148 Numerics 80/20 situation 4 A Access Integration pattern 25 aggregated content 138 APPLET tag 128 Application Integration 44–45 runtime patterns 76 Application Integration pattern 25, 28 Application patterns 5, 11 Application Server node 69 Architect 23 architecture 137 asynchronous 70 authentication 68, 144 authorization 68, 146 B back-end applications 69 back-end integration 28 Benefits 31 Best practices 5 Web services 146 Best-practices 16 Business drivers 20 Business patterns 4, 7, 23, 25 C Cascading Style Sheets 125 Categorization 117 certificates 68 Chrisco Books 153 cHTML 129 Collaboration 26, 72, 138, 145 Collaboration business pattern 25–26 Collaboration node 70 Collaboration pattern 25 © Copyright IBM Corp. 2004. All rights reserved. collaboration services 145 Command-Manager design pattern 141 common data model 115 Community 70 Composite patterns 5, 9, 23, 25 components 23 Content 143 Content management 72, 143 Content Management node 71 controller 139 cross-selling 71 CSS See Cascading Style Sheets Custom designs what are 36 D data aggregation 18, 73 Data integration 45 data sources 69 Database Server node 71 DB2 Information Integrator for Content 100–101 demilitarized zone (DMZ) 69 Developer 24 DHTML 129 Directory and Security Services node 68 documents 71 Domain Name Server node 68 Dynamic HTML DHTML 126 E ECMA-262 126 ECMAScript 126 EIP see DB2 Information Integrator for Content EMBED tag 128 Enterprise Information Portal see DB2 Information Integrator for Content Extended Enterprise business pattern 25, 27 Extensibility 22, 95 Extensible Markup Language 130 229 F see WebSphere Portal Search Engine Federated Repository 55 Federation 41, 44–45, 55, 65, 105 application pattern 56 product mappings 107 runtime pattern 83 with external data 84 federation 74 Field mapping 115 Firewalls 69 functions 94 G Guidelines 5, 16, 137, 145 H HTML 70, 126 Validator tools 125 HTTP 69 HTTP tunneling 128 HTTP/HTTPS 97 I IBM Global Services 23 Images 71 indexed 71 Information Aggregation 44, 57 runtime patterns 85 Information Aggregation business pattern 25, 27 Integration patterns 5, 7, 23, 25 Internet Service Provider 68 IT drivers 21 J J2EE 129 Java applets 127 Disadvantages 128 Java programmer 24 Java Runtime Environment 128 JavaBeans 130 JavaScript 126 JavaServer Pages 129 JDBC 97 JRE See Java Runtime Environment JScript 126 JSP 69, 129 Juru 230 Patterns: Portal Search Custom Design L Layered design 147 LDAP directory 137 LDAP/LDAPS 97 leveraging legacy investments 37 Limitations 31 Loose coupling 147 Lotus Discovery Server 102 Lotus Domino 101 Lotus Extended Search 98 agents 209 architecture 207 brokers 212 configuration database 215 links 209 M Maintainability 22, 95 mobile 72 model 139 Model-View-Controller design 139 Model-View-Controller design pattern 139 Multi-client device 72 Multi-Tier Design 136 MVC structure 142 N network protocols 97 P Patterns for e-business 3 Application patterns 5, 11 Best practices 5, 16 Business patterns 4, 7 Composite patterns 5, 9 Guidelines 5, 16 Integration patterns 5, 7 Product mappings 5, 15 Runtime patterns 5, 12 Web site 6 performance considerations 119 personal computing device 68 Personal Digital Assistant (PDA) 134 personal digital assistant (PDA) 68 Personalization 72 Personalization Server (Rules Engine) node 70 Pervasive User node 72 platforms 96 Population 47, 65 Data Cleansing 45–46 application pattern 49 Index Population 44–45, 50, 52, 105 application pattern 52 product mappings 106 runtime pattern 76 with Data Cleansing 82 with external data 80 Multi Step 45 application pattern 48 applied to indexing 78 Multi-step 41, 44, 46, 48 Single Step 44–46 application pattern 47 Synchronization 45, 54 application pattern 54 population 74 Population Crawl and Discovery 50 Population Summarization 50 Portal application design 133 Portal applications 133 Portal characteristics 28 Portal composite pattern 18, 24, 27, 30–31, 137, 141 Portal composite runtime pattern 69, 72 portal implementation 23, 68, 72, 137, 143 Portal search the need for 36 Portal Search custom design 35 a scenario 153 application patterns 41 business drivers 37 compared to composite pattern 42 IT drivers 37 product mappings 96 protocol mappings 98 Runtime pattern 73 runtime pattern 94 portal system 143 Portlet API 135 Portlets 142 Presentation Server 72 Presentation Server node 70 Process integration 45 Product Descriptions 98 Product mappings 5, 15 products 94 Project Manager 23 Protocol and Domain Firewall node 69 Public Key Infrastructure node 68 Q Query syntax 114 R Redbooks Web site 227 Contact us xiii Replication 54 requirements 30 resource connection pooling 69 resources 71 Reuse 22 RM/IIOP 97 rules 70 Runtime patterns 5, 12, 28, 67, 72, 94–95 S Sales 23 saved searches 123 Scalability 22, 95 SCRIPT tag 128 search 74 Search & Indexing 72 security 71, 134, 145–146 Security concerns 118 Self-Service business pattern 25–26 Servlets 128 Signed applet 128 Single Sign-On 31, 70, 138 Single Sign-On (SSO) 144 Single-Tier Design 136 SOAP 140 SOCKS proxy 146 SSL protocol 68 Summarization 116 Sun ONE 146 Swing 127 synchronous 70, 72 systems 18, 138 U User / Internal User node 68 Index 231 User Information Access 57 application pattern 58–59 immediate update 59 User Management 145 User Search & Discovery application pattern 61 runtime pattern 86 search adapter variation 87 search service variation 88 with external users and data 89 User Search and Discovery 41, 44, 61, 64, 105 product mappings 107 V Validator tools HTML 125 VBScript 126 Versioning 144 View 139 VPN 68 W Web container 129 Web Server Redirector node 69 Web Services 136, 140 Web services Best practices 146 WebSphere Application Server 103 WebSphere Content Publisher 71 WebSphere Porta 103 WebSphere Portal Search Engine 104 usage hints and tips 219 WebSphere Portal Server 135 Wireless Gateway node 72 Wireless Markup Language 129 Workflow 26, 72 workflow 138 X XML 129–130 232 Patterns: Portal Search Custom Design Patterns: Portal Search Custom Design (0.5” spine) 0.475”<->0.875” 250 <-> 459 pages Back cover ® Patterns: Portal Search Custom Design Applying the Information Aggregation patterns to portal search solutions Hints/tips for using IBM search technologies A portal search scenario The Patterns for e-business are a group of proven, reusable assets that can speed the process of developing applications. The Portal Search Customer Design builds off the Portal Composite Pattern, combining Business and Integration patterns to help implement a portal search solution. Part 1 of this IBM Redbook provides introductory material around the IBM Patterns for e-business, and the Portal Composite Pattern. Part 2 guides you through the process of choosing the Business and Integration patterns of the custom design, and then drills down to the Application and Runtime patterns, and Product mappings. Part 3 provides a set of guidelines for implementing and building a portal search solution, including a discussion of search technology selection criteria, as well as application design and development. Part 4 demonstrates how to implement a portal search solution via a technical scenario. This technical scenario uses the WebSphere Portal Extend offering, combined with Lotus Extended Search. INTERNATIONAL TECHNICAL SUPPORT ORGANIZATION BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment. For more information: ibm.com/redbooks SG24-6881-00 ISBN 0738498289
© Copyright 2024