Document 437783

 The Case for Lucene/Solr: A Manager’s Guide to Real World Open Source Search Applications By Lucid Imagination Abstract In today’s information-­‐driven environment, search is a critical solution to problems when it slashes the time and effort separating end users from the data they value. Search spans the range of business models and use cases—from driving direct customer sales, to analytics and business intelligence, employee productivity, and reduced administrative overhead. Making the best use of search requires two perspectives: both a look at the business requirements for a search application and a view to new business opportunities created by using search to leverage the organization’s content resources. Thousands of organizations across different sectors and business models have harnessed Apache Lucene/Solr to search their rapidly growing and diversifying content resources. Underlying this broad adoption is the extraordinary power, scalability, and versatility of open source search technologies. This paper provides an overview of both the requirements and the opportunities for search applications. It then explores how real world organizations are successfully using Lucene/Solr search applications to meet those opportunities, presenting how the technology is used for specific business models and use cases across industries. In addition, it offers a baseline for setting search requirements that managers and architects can use to adopt Lucene/Solr, and adapt this open source search technology to the unique needs of their business. © 2010, Lucid Imagination The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page ii
Table of Contents Introduction ............................................................................................................................................................... 1 Understanding Search Opportunities and Requirements ...................................................................... 2 What Data and Documents Are You Searching? ................................................................................ 3 Who Needs the Results and Why? ........................................................................................................... 3 Where Is Search Integrated with IT Infrastructure? ....................................................................... 5 How Is the Search Interface Presented to the User?........................................................................ 5 The Real World: Applications and Case Studies ......................................................................................... 7 Yellow Pages, Local Search, and Searching Classifieds........................................................................ 8 Media .......................................................................................................................................................................10 E-­‐commerce..........................................................................................................................................................12 Job and Career Sites ..........................................................................................................................................14 Libraries, Archives, and Museums (LAMs) Search ..............................................................................16 Social Media Search...........................................................................................................................................18 Enterprise (Intranet) Search.........................................................................................................................21 Business Use Case Matrix ...................................................................................................................................23 Appendix: Lucene/Solr Features and Benefits..........................................................................................24 The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page iii
Introduction
As fast as companies, communities, and consumers produce data—about each other, products, opinions, research, and everything else imaginable—they need faster, more versatile search capabilities to find the information they need to create opportunities for competitive advantage. In today’s information-­‐driven environment, search addresses the critical problems created by the explosive growth of content by slashing the time and effort users expend in finding data they value. Search spans the range of business models and use cases: from driving direct customer sales, to analytics and business intelligence, employee productivity, and reduced administrative overhead. Apache Lucene/Solr1 open source search technology has been implemented across the broadest range of applications and business models—and likely in ways that can fit the needs of your organization. In successful operation today at thousands of enterprises, Lucene/Solr technology scales from tens of thousands to hundreds and billions of documents; searches data that is structured, unstructured, and in combination; data inside and outside the firewall; and ranges in use from a simple website search box through sophisticated faceted navigation. It addresses equally diverse business processes and mission critical applications. Across the spectrum, Lucene/Solr helps users find, make sense of, and act upon information quickly and efficiently. In this white paper, we’ll review real-­‐world case studies for Lucene/Solr functionality across business sectors to demonstrate its versatility and varied applicability. The diversity of examples provides strong evidence of Lucene/Solr’s flexibility and power as a search technology. The examples also attest to the innovation and transparency inherent to the open source development model. Our focus is on familiarizing the audience of business managers and application owners with existing Lucene/Solr applications; the substantial technical advantages to developers are covered elsewhere. 1
Lucene and Solr are complementary technologies that offer very similar underlying capabilities; Solr is the Lucene
Search Server. Since Lucene serves as the core of Solr’s search capabilities, this paper refers to the two as
Lucene/Solr. For more information, see the Appendix.
The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 1
We’ll first survey the key requirements and business use cases of search and then look at where they are built into search applications. Our objective is to provide business managers and application owners with a broad perspective on how Lucene/Solr search technology is used to build solutions to compelling business problems. In the Appendix, we provide an overview of Lucene/Solr’s key features and benefits, with a basic outline of the capabilities offered to meet the broadest range of business needs. Understanding Search
Opportunities and Requirements
Search technology has come a long way from its roots in matching keywords with appearance in documents and obtaining undifferentiated results. Search today empowers users by delivering actionable information quickly and efficiently, across multiple, diverse sources of data. The business use cases range from executing mission critical commercial transactions (e.g., e-­‐commerce sites) to unlocking employee and end-­‐user productivity in the search for a single relevant document (e.g., enterprise search). Given the breadth of capability of the problem domain, it’s useful to look at search and ask two fundamental questions: “How it can it solve my business problems?” and “What new business opportunities can search solve for?” In considering how search technology solves business problems, it is useful to start with an elucidation of the requirements you’ll need to consider for your search application. At the same time, be sure to look more broadly at the capabilities that Lucene/Solr offers, as it can help open up new frontiers for incorporating search and leveraging more value from data repositories. Starting with some basic questions—what, who, how, and where—you can clarify the high-­‐level business requirements specific to your business needs, which in turn allow you to make the best decisions for your search application. The process of looking at the fundamentals also raises new questions about how and where the search technology offered by Lucene and Solr can create new business opportunities. Let’s look at four fundamental questions you should address in understanding search opportunities and requirements: •
•
•
•
What data and documents are you searching? Who needs the results and why? Where is search integrated with IT Infrastructure? How is the search interface presented to the user? The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 2
What Data and Documents Are You Searching?
Business today is driven more than ever by the end-­‐users’ creation and consumption of real-­‐time information. A key differentiating capability of search technology is ingesting a broad range of content types and processing large collections of diverse data in real time in order to deliver actionable information. Two aspects to consider: •
Types of Content Content comes in multiple formats: HTML pages, XML files, PDFs, images, PowerPoint presentations, Excel spreadsheets, Word documents, log files, multimedia content, and more. Content resides in various repositories, including databases, file servers, content management systems, archiving systems, collaboration applications, and employee desktops and laptops. Search technology must be able to locate, organize, and aggregate data whatever its form or location. •
Frequency of Updating Content Organizations update content at varying intervals, driven by differing business processes and models—social media or news applications have real-­‐time content need, whereas an e-­‐
commerce application might re-­‐index in response to new inventory on a batch basis and a research institution might add to its collection less often still. Search applications need to be adaptable to the differences in content change frequency. Who Needs the Results and Why?
Business search puts a high priority on end user experience and results in which the searched content is tuned to the unique needs of each user. Because, after all, the human dimension—the usefulness of results and the efficacy of interaction—is the acid test of a search application. Internet search applications like Google, Yahoo, and Bing are now common and mature. They have raised user expectations about key qualities of the search experience...but they solve a very different problem. While Internet searches can produce millions of results in milliseconds, they rely on measures like website popularity or URLs and domain names—not relevant and not generally applicable to purpose-­‐built applications for businesses. What’s more, they rely on generalizing relevancy for a global population of all Internet users, without being tied to business rules, or business process logic, or the opportunity cost of improved precision for a specific set of data or search users. Business search applications cannot rely on such brute force coarse approaches to tune their results. They need far more control and precision. They have to be able to deliver highly useful results while matching, if not exceeding, the levels of user experience that people have come to expect by virtue of their daily interactions with commercial search engines. Key points of consideration from a business perspective are: The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 3
•
Relevance Relevance is entirely a factor of the goals of the search application’s users. The application must have the mechanisms to recognize the subjective needs of users and tune results accordingly. It must also provide easier ways to narrow search criteria without requiring users to come up with perfect query terms. Flexibility for drilling deeper will make results richer and valuable. Mechanisms to apply filters, proximity values, and sorting parameters to narrow search scope can also lead to a richer set of more useful results, with less time and effort. •
Cost of Relevance As business goals are driven by revenue opportunities and cost savings, it is critical to tie relevance to the economics of the business. For example, a public-­‐facing retail site should focus on matching merchandise to search, site stickiness, and customer loyalty. It requires search technology that streamlines and simplifies the shopping experience with relevant results directly contributing to sales revenue. For knowledge workers, internal search applications should help make employees more productive by reducing the amount of time and effort to find documents they need to do their jobs. Multiple studies show that information workers can spend 20–30% of their time searching for information. •
•
Precision Ranking Result accuracy, sorted by attributes like relevance, date, field, or any document property feature, makes the search process better. End users generally abandon a search before tackling the fine points of Boolean logic or scrolling for a result buried too far down. Query Response Speed Today, 5–7 seconds is the typical threshold for end-­‐user patience. Too much wait time for search results frustrates users, and causes them to abandon pages. Fast, relevant results cannot be limited by search technology hamstrung by data influx or query overload. Query response time should also work hand-­‐in-­‐hand with the refinement of multiple search attributes, so that increasingly complex queries do not extract a performance penalty. The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 4
Where Is Search Integrated with IT Infrastructure?
Useful, valuable search technology rarely exists in isolation. Searched data is transformed into actionable information when it is integrated with the organization’s information infrastructure: business process to business intelligence to content management systems. A robust search technology must be customizable to integrate with the existing systems seamlessly. •
Application Integration A key requirement for a search application is its extensibility for integration with existing infrastructure and applications like content management systems, databases, and the full range of business processes and applications. It should have interfaces that support ingestion of data as well as delivery of results in readily consumable formats—because in many cases, results are consumed by other applications, not a human. •
Scalability We can assume that data will change and grow. So scalability is a key factor for search application. Applications should grow to address future needs without penalties for the breadth of data or for the count of documents indexed. The search application should be able to grow with the requirements of the organization, without needing additional large investments in hardware to match the pace of growth. Proprietary search vendors often charge for search by the number of documents indexed. In a world where constantly expanding content growth is the norm, such costs can be a real and substantial drag on the cost of ownership for search applications, many times resulting in negative return. •
Security Every organization has its own security requirements and access controls. Search technologies need to comply with the security policies of the enterprise, controlling results that have restricted access. The search technology should also be able to make use of document-­‐level security from other sources. How Is the Search Interface Presented to the User?
The user interface is where search delivers on findability and presents actionable results. The search application is only as good as the convenience of submitting queries, reviewing and refining results, and finding information. Key aspects to consider: The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 5
•
Navigation Users benefit from guidance that makes their queries more productive. Techniques such as faceted search with result clustering, advance hinting (“did you mean”), “more like this,” and drop down menus for setting search scope help users achieve desired results faster, making a search application both user-­‐ and information-­‐friendly. It is also important to allow users to draw associative connections between results—using the technology to uncover relationships and discover more about what they were seeking than they knew at the outset. The NetFlix search application is powered by Solr; it adds the fuzzy dimension to search, with auto-­completion of movie names, correction of misspelled names of actors, and suggests titles closest to the query. As a result, 85% of users have found the movie they were looking for ranked at the #1 spot in the results. •
Discovery Search application functionality should extend beyond the generic presentation of a result list of documents that contain a keyword. Highlighting keywords in searched results, expanding searches with synonyms and spell checking, and offering users ways to learn a bit more about documents in the results without having to load the document are great ways to significantly improve usability. •
Intuitive Intelligence Search applications must go beyond keyword search to help users retrieve accurate information even when they are not sure of the best keywords. Additionally, they should reduce misinterpretations where homonyms, spelling errors, and ambiguous keywords are involved (e.g., is “apple” a fruit or a computer company?). The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 6
The Real World: Applications and Case Studies
With an understanding of the fundamentals of search business applications in hand, it is helpful to gain additional context on business usage through a survey of organizations that have successfully used Lucene/Solr for powerful search applications. All of these cases were built on the capability of Lucene/Solr to provide innovative, high-­‐
performance, cross-­‐platform, feature-­‐rich search technology suitable for nearly every application. By powering diverse search applications for thousands of organizations such as AT&T, Zappos, McClatchy, Smithsonian, MTV Networks, LinkedIn, MySpace, Comcast, Monster, Netflix, and many more, Lucene/Solr has provided mission critical capability that turns search into a robust competitive advantage. For these organizations, Lucene/Solr solutions regularly index and search hundreds of millions of documents with subsecond response time, unencumbered by costly licensing or vendor lock-­‐in. Together they represent a compelling argument for the broad applicability of Lucene/Solr across the full range of business opportunities and search needs. Business use case studies we’ll review include: •
•
•
•
•
•
•
Yellow Pages, Local Search, and Searching Classifieds Media E-­‐commerce Job and Career Sites Libraries, Archives, and Museums (LAMs) Search Social Media Search Enterprise (Intranet) Search The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 7
Yellow Pages, Local Search, and Searching
Classifieds
In the business of online local search, geographic-­‐based (location) relevance generates competitive advantage. Online directories need to provide a rich, interactive search experience to users to increase site views and stickiness, which in turn translates into increased advertising revenue. Simplified location-­‐based search, intuitive faceted query response, and data mashups are a few features that define search functionality for an online directory. Lucene/Solr solutions offer accurate search results, factoring in location, users’ reviews, and ratings, alongside paid advertising. By taking advantage of Solr’s open source model—with search algorithms that are completely transparent—companies can invest in configuring their search solutions to match their business logic, rather than trying to infer or pay for exposure proprietary back-­‐
end logic. Requirements •
•
•
•
•
•
Internet Yellow pages and local online search is forecast to grow to $27.8 billion in 2011. The Kelsey Report1 Success Stories •
•
•
YP.com, a division of AT&T Interactive Zvents.com, local event search service Yelp.com, the community local search site M
Solr Solution •
•
•
•
•
1The Kelsey Group’s Global Print Yellow Pages, Internet Yellow Pages and Local Search Five Year Outlook The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Intelligent results going beyond keyword search Deeper, faceted navigation Seamless integration with latest Web 2.0 tools Lower IT-­‐related costs Geocentric user experience Search numeric values Customizable Search Index which can be tuned transparently to account for key findability drivers Drop down filters for narrowing or widening the scope of search Seamless integration with existing technologies Native numeric encoding and search capabilities Reduced server footprint for lower TCO than most commercial vendors Page 8
Case Study 1 yp.com by AT&T Interactive AT&T Interactive is an online and mobile search and advertising company. Their leading-­‐edge portal, yp.com—an online business listing and advertising site—was originally implemented with a commercial proprietary search application. It faced issues of scalability, vendor lock-­‐in, and performance. With help from Lucid Imagination, AT&T successfully migrated to a Solr-­‐based search solution that leveraged the flexibility of open source without compromising features and functionality. And they did so with a much smaller budget. Business Needs Addressing the need to factor in location to support geographic search, and include relevant comments Striking a balance between organic search and advertised content Indexing highly unstructured content such as user comments Increasing relevancy of results and boosting paid search results for preferential placement of advertisers Linguistic support to enable search experience, such as spellchecking, synonyms, find-­‐similar, etc. Integrating with latest Web 2.0 tools Reducing server footprint The Solr Solution •
•
•
•
•
•
•
•
•
•
•
•
•
Context-­‐specific relevancy, geographic proximity, ad placement, and user comments Faceting, drop down filters to narrow/widen the scope of search Functional support for creating new features Spell-­‐correction, and location-­‐optimized search results to show users businesses nearest to them first Seamless integration with many Web 2.0 tools to create innovative features and mashups Lowers TCO by reducing the number of search servers from 120 to two dozen servers The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 9
Media
Brand reinforcement, premium content, and easy accessibility are the main business motivators for online media and publishing companies. Relevant information improves time on the site and encourages users to explore related content, boosting subscription rates and site views. These translate into a virtuous cycle of additional revenue generation. Given that content is the business, the need for a robust search application ties directly to competitive advantage. Lucene/Solr provides a customized, function rich solution for the media and publishing industry. It addresses dynamic challenges of content diversity, content freshness, and content acquisition , and gives companies a platform on which to build a world-­‐class innovative search experience to differentiate themselves in a highly competitive marketplace. “Solr has done wonders for us. It is easy to understand and deploy, and has reduced our costs drastically.” Doug Steigerwald, McClatchy Interactive Requirements •
•
•
•
Solr Solution • Reverse indexing • Intelligent, faceted search to enable contextual and linguistic relevance • Easy configuration for parsing structured and unstructured data • Easy and seamless installation for lower TCO • Customization with open source code Real-­‐time indexing of petabytes of structured and unstructured data Deeper search capability Improved query response time Reduced infrastructure and customization costs Success Stories •
•
•
•
•
•
McClatchy Newspapers Netflix Comcast Interactive MTV Networks, a division of Viacom M
The Motley Fool, fool.com Fanfeedr.com, personalized sports aggregator The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 10
Case Study 2 McClatchy—Leading Newspaper Publisher The third largest newspaper publisher in the United States, McClatchy Company owns 30 daily newspapers in 29 markets across the country. To win online, McClatchy knew it had to have a robust search solution, to empower the McClatchy audience with the information they wanted and secure loyalty from readers and sponsorships from advertisers. Working with Lucid Imagination, McClatchy migrated from proprietary search software to open source and chose Solr for its high performance, comprehensive capabilities, and superior value Requirements • Proliferating content and data sources (text, videos, audios, images), with real-­‐time streaming • Empowering end users with ease of use • Supporting peak traffic and popular search spikes with consistent performance • Providing scalability for a database growing by orders of magnitude annually • Providing flexibility to support customization • Controlling IT costs while exceeding performance benchmarks of competition The Lucene/Solr Solution • Deeper content by indexing both structured and unstructured data in real time, effortlessly • Indexes millions of documents, with search results delivered in milliseconds • User-­‐friendly navigation with drop down filters, faceted navigation, linguistic corrections, etc. • Excellent performance, even in peak hours, by load-­‐balancing search requests across servers • Scalability without impact on performance • High degree of customization, since it’s open source • Integration with existing IT infrastructure and eliminates associated license fees to cut costs • 8-­‐fold reduction in server footprint The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 11
E-commerce
E-­‐commerce businesses must provide a compelling shopping experience in order to maintain brand equity and thrive in a very highly competitive market landscape. By reducing the time and effort required to navigate available merchandise and find what they want, superior search contributes directly to a satisfying buying experience for customers. Search then translates directly into higher revenues and customer loyalty. Instant results, intuitively organized, advanced faceting for easy browsing, synchronizing results with images, and integration with user ratings are among the must have features of an e-­‐commerce search application. Lucene/Solr gives companies the ability to build their sites around the concept of “searchendizing”—putting the desired merchandise at the top of the results list—which can make the difference between sales made and sales lost. Faceting, database integration, real-­‐time indexing, and query monitoring all enable users to find products they want, driving conversion rates and enabling a winning online experience. 2 Success Stories •
•
•
•
•
•
Buy.com Sears.com Macys.com Zappos.com Advanceautoparts.com Dollardays.com Online retail sales in the B2C market are expected to reach $340 billion by 201321 Forrester Research Requirements •
•
•
•
•
Solr Solution •
•
•
•
2 “Consumers will spend more than $340 billion online by 2013, says Forrester,” Internet Retailer, 27 November 2009, http://www.internetretailer.com/dailyNews.asp?id=32630. The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Multidimensional, dynamic search Faster results Real-­‐time indexing of products Faceting and browsing capabilities Seamless integration with existing IT infrastructure Faceted search for deeper drill down and browsing Intuitive search capabilities for cross-­‐channel shopping experience System administration tools for data loading, index replication, monitoring, logging, and cache management Query monitoring for better highlighting of popular products Page 12
Case Study 3 Zappos Zappos is the premier destination for online shoe shopping. At Zappos, the mission is excellent online customer service—customers should be able to browse shoe styles, sizes, shapes, and colors more easily than any other shoe store, on or offline. To achieve this, Zappos wanted a robust, flexible, multifunctional search solution/application. After evaluating many commercial search technologies, Zappos zeroed in on Solr, working with Lucid Imagination to ensure continued, successful deployment. Requirements Simplified, attractive user experience that makes it easy to find and buy Relevant results, fast Navigation across attributes, such as size, color, and style for broader and deeper results Indexing products as they were entered in the catalogs Cross-­‐functional navigation to give customers a realistic shopping experience Intuitive intelligence to provide alternate suggestions Analytical capabilities to drive business strategy Facilitating control on results Integration with existing IT infrastructure The Solr Solution •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Search results in subseconds, across categories Faceting, for easy browsing and discovery and a compelling user experience Real-­‐time indexing of products Synchronization of visuals, specs, filters, and promotions to make shopping experience true to life Information on user activity to help build strategy on product promotions Controls to rank popular or high-­‐stock products in results where users are more likely to buy them Facilitates integration with heterogeneous open source environment The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 13
Job and Career Sites
Job portals are countercyclical to the economy. When the economy flourishes, posted jobs grow in number; when it sags, candidates flock in to post their resumes. Success for an online job portal is tied to the efficiency of its search capability—matching résumés to job listings and vice versa—so both employers and prospective employees can zero in on just the right opportunity. Requirements •
•
For example, an employer may want to navigate through filters to narrow the scope of a candidate search, such as education, previous employer, salary history, skillsets, etc.; a job seeker may want to expose these attributes, but keep a current employer’s name confidential. A job-­‐
seeker may want to apply to jobs within a particular geographic area. •
Lucene/Solr not only provides such flexibility but also addresses other complexities of this industry by enabling linguistic intelligence (such as identical acronyms that correspond to different entities; variations in spelling, imperfectly constructed search queries); indexing unstructured data (résumés); and managing ever-­‐growing data. •
“I think the breakthrough was when we tried it, and we realized, wow, this thing could really scale.” Peter Keegan, Monster.com Success Stories •
•
•
•
Monster The Big Jobs eBharatJobs Careerjet •
•
Linguistic intelligence for more relevant results Control search results to maintain privacy Deeper search capability Numeric search Faster query response Reduced infrastructure and customization costs Solr Solution • Intelligent, faceted search to enable contextual and linguistic relevance • Easy configuration for parsing structured and unstructured data • Easy and seamless installation for lower TCO • Business process integration and Customization with open source code M World Search Applications
The Case for Lucene/Solr: Real
A Lucid Imagination White Paper • January 2010 Page 14
Monster.com Monster is the largest job search engine in the world, with over a million jobs posted at any one time. By 2008 it had 150 million résumés in its database, serving over 63 million job seekers per month, now running on average 300 to 400 queries per second with an average response time of 40 milliseconds. To provide the highest level of service and support to their customers—both employers and job seekers—Monster has an unmatched marketplace for employment opportunities, with Lucene-­‐based search at the heart of its business model. The Requirements Managing high volumes of data, continually increasing by double digit percentages annually Maintaining constant inventory updates and providing faster results Removing technological barriers that limit the scope of information Enabling end users to refine search and drill deeper without any performance impact Providing security controls to ensure end user privacy Facilitating scalability and flexibility in tandem with company’s vision and growth plans The Lucene Solution •
•
•
•
•
•
•
•
•
•
•
•
High volumes of data by clustering data to reduce the index size Real-­‐time indexing for fresher, faster query results Intuitive search to enable in-­‐depth cross-­‐functional job and résumé browsing Faceted search and ‘single click’ filters for search refinement Security controls to manage user information Unlimited scalability and customization leveraging open source licensing The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 15
Case Study 4 Libraries, Archives, and Museums (LAMs) Search
The core asset of educational and research institutions is knowledge archived and accumulated over decades. In the world of academic search, the diversity of information for any query—text, illustration, audio/video media, or data in any other format—makes unstructured formats a key aspect of the searchable archive. Lucene/Solr gives academic and research institutions the power to turn information into knowledge by going beyond keyword-­‐driven search to expose a rich variety of results and exploration. Based on the open source model, it not only integrates with the existing IT infrastructure but also leverages the existing classification hierarchies to give structure to terabytes of information spread across disparate collections, significantly reducing overhead and enabling flexible and scalable deployment. •
•
•
•
“With Solr, you can do so many things without writing a lick of code. I hadn't realized how easy it is to extend our custom request handler, response writer, and update handler. Just move it all to Solr and let it do the heavy lifting.” •
•
Sjored Siebinga, Europeana Success Stories •
•
•
•
•
Smithsonian Institute Europeana, the European Union online cultural archive The US Library of Congress and World Digital Library Stanford University Library University of Michigan Graduate Library Management of multiple formats of data and documents Customization and scalability Linguistic support in queries Faster results Solr Solution •
Requirements •
•
Optimized index infrastructure limits size without compromising speed or flexibility Easy customization for implementing taxonomy rules Faceted search to narrow results to a specific source across diverse sets of data Instant results Seamless integration with IT infrastructure for lower TCO The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 16
Case Study 5 Smithsonian The Smithsonian Institution is the flagship museum collection of the United States, supporting a research institute that provides “one-­‐stop” searching for 2 million records, including nearly a quarter of a million media files (images, media files, online journals, and other resources) distributed across dozens of archives, databases, museums, and libraries. To make this treasure of information easily accessible to people, the Smithsonian needed an efficient search solution that could overcome the following challenges: The Challenges Managing a complicated taxonomy that could no longer accommodate a growing data index Indexing disparate types of content, including documents, videos, and images Making information available from a large database Providing access controls to restrict information Integrating with existing legacy tools Smithsonian chose Lucene/Solr, and worked with Lucid Imagination to create an optimized, well-­‐designed solution. •
•
•
•
•
The Solr Solution •
•
•
•
•
•
Efficient index strategy to manage a mix of structured and unstructured data Holistic search, by optimizing configuration to reduce the number of servers and better handling query requests Filtering information through faceted search Access controls to restrict information based on membership profiles Integration with the existing IT infrastructure Provides guidance and assistance on setting replicated search environment The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 17
Requirements Social Media Search
Search solutions must support differentiated business models matching Web 2.0 innovations, including user-­‐generated content and mashups, without compromising scalability—a challenge, given the virtually limitless content on the Internet. Success and differentiation is measured by how well the site provides relevant results to grow its user base and keeps them engaged. Increasingly, the technological factors driving Web 2.0 application paradigms are finding their way into the enterprise, unlocking collaboration and productivity in new ways that challenge conventional organizational bounds—and that rely in equal measure on search to create the connections between employees to enable discovery, cross-­‐pollination, and more efficient collective effort. Lucene/Solr not only provides fast results but also facilitates flexible, intuitive navigation to help end users connect with others. It boosts the reach and performance of search, while cutting implementation costs and lowering barriers to innovation. Success Stories •
•
•
•
•
•
•
Digg Myspace LinkedIn Reddit Technorati Scout Labs Xmarks.com “With Solr, we really treat it as kind of a platform where we can build other kind of things on top of it… We have a very valuable set of data, and we really want to explore new ways of building new features from that data set.” •
•
•
Deliver search results as soon as content is available Deeper drill down capabilities Intuitive interface Lucene/Solr Solution •
•
•


•
•
Near-­‐instant results with segmentable indexing Intuitive search Data-­‐driven spellchecking based on user search histories Linguistic support through ‘Did you mean" functionality Highlighting keywords Deeper drill down with faceting Real-­‐time content updating —Sammy Yu, Digg.com The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 18
Case Study 6 Digg.com Digg displays the wisdom of the crowds. By leveraging the mass collaboration of readers distributed across the Internet—everything on Digg is submitted by the public community for the public community—it builds on the easy findability of information valued by the marketplace of readers and consumers. Digg realized early on that to succeed in the business of information, they needed to make information available to their audience as effortlessly as possible. They saw the following challenges as roadblocks for implementing a base search application: Requirements Managing unstructured data (13 million documents and growing) in real time Providing results faster Facilitating smart navigation to provide information in digestible portions Recognizing and eliminating duplicate content Providing semantic and linguistic smart application Facilitating scalability while retaining costs Digg selected Solr for its unmatched flexibility and functionality. •
•
•
•
•
•
The Solr Solution •
•
•
•
Highly customizable and flexible Results in subseconds, with simple-­‐to-­‐use pull downs to refine results Fuzzy duplicate detection (by coding) Unlimited scalability and seamless integration with the heterogeneous environment The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 19
Case Study 7 LinkedIn Connecting 50 million registered users from 200 countries across 170 industries and matching them to the right professional contacts is what LinkedIn is all about. LinkedIn’s business is premised on intelligent search application that could overcome the following: The Challenges • Managing an ever-­‐growing database, with one new member joining and creating a profile every second Indexing unstructured data in real time Giving instant query responses, even in peak traffic hours Providing intuitive navigation and intelligent linguistic support Integrating with other Web 2.0 tools to build user profiles that integrate data from multiple sources They chose Lucene to implement the search function at the core of their business model. •
•
•
•
The Lucene Solution •
•
•
•
•
Used index segmentation for faster results and to limit index base Provided faceted search and intelligence support features like changing the view of search results and auto-­‐completion of contacts Calculated relative relevance, ranking results on the fly based on relationship between the user’s profile and the other profiles being searched Integrated with the latest web tools; for example, incorporating videos in search results Provided "scale as you grow” facility through the flexibility of the open source model The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 20
Enterprise (Intranet) Search
Enterprises today have a global footprint, which leads to the creation of multiple content types and the use of disparate applications and content management systems across business centers. The result is often silos of unmanaged data spread across the intranet of an enterprise—a situation where information is omnipresent but cannot be used. To achieve a competitive advantage, enable intelligent decisionmaking, eliminate duplication of work, and lower the cost of ownership, enterprises need a search application that gives structure to unstructured data; provides a single gateway to search across multiple enterprise repositories, with speed, flexibility, and intuitive intelligence. Lucene/Solr is a solid match for enterprise search. As a customizable and multifunctional search application, Lucene/Solr provides robust search features at minimal cost. The open source development model behind Lucene/Solr integrates seamlessly with legacy tools, and brings down the total cost of ownership significantly. Given the sensitive nature of enterprise content, Lucene/Solr facilitates document-­‐level, role-­‐based security. And with the transparent search algorithms and configurability for relevancy, Lucene/Solr enables intranet search with the precise control enterprise content owners require, ensuring that results consistently deliver the right documents to the right people. Requirements Single interface to access enterprise data • Faster results • Control over search results • Ready integration with existing content management software Solr Solution •
•
•
•
•
“The search and discovery software market grew 19 percent in 2008 to $2.1 billion” Sue Feldman, IDC Single gateway for all types of data Dynamic boosting of content Transparent search algorithms and relevancy tuning Customization and easy integration with open source code M
The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 21
Case Study 8 Food and Drug Administration The Food and Drug Administration (FDA) is a U.S. government agency responsible for regulating and supervising the safety of foods medications, veterinary products, tobacco, and cosmetics. The FDA has a large repository of information that dates back multiple decades, and exists in formats ranging from early optical character recognition to recent electronic formats. To mine this knowledge base, the FDA is developing a semantic mining framework using open source tools such as Apache Lucene and Solr. Requirements •
•
•
•
•
•
•
•
Integrating petabytes of data highly distributed across the intranet of an enterprise Managing multiple indices for documents stored in distributed repositories Managing and maintaining archival data and evolving vocabularies Indexing unstructured data in real time Recognizing and eliminating duplicate content Handling concurrent queries and delivering fast and relevant results Restricting search results according to agency access control policies Integrating with existing infrastructure without additional overhead The Lucene Solution •
•
•
•
•
A single gateway to search across multiple enterprise repositories Duplicate detection Fast and relevant results with content analysis and query interpretation algorithms Filters results based on access controls and security policies of an enterprise Facilitates integration with existing enterprise infrastructure to reduce TCO The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 22
Business Use Case Matrix
To simplify mapping your search needs to existing search applications in the real world, the matrix below compares business use cases against key search requirements. While not an exhaustive list, the matrix highlights the different business use cases across sectors and business models, reflecting the adaptability of Lucene/Solr across the various domains of search applications and use cases. Users
Verticals
Enterprise (Intranet)
Education
Internal
Content
Customer
Facing
√
Original
Content Update Frequency
Aggregated
High
√
Schools/
Universities
√
√
Libraries
√
√
√
Medium
Low
Access
Control
√
√
√
√
√
√
√
√
Job Portals
√
√
√
√
Social Networks
√
√
√
√
News
√
√
√
√
Media
√
√
√
√
E-Commerce Sites
√
√
√
√
√
√
√
Media
Financial Services
√
Yellow Pages
√
Horizontal Portals
√
√
√
√
√
√
√
√
√
√
The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 23
Appendix: Lucene/Solr Features and Benefits
Lucene and Solr are complementary technologies that offer very similar underlying capabilities. In choosing a search solution that is best suited for your requirements, key factors to consider are application scope, development environment, and software development preferences. Lucene is a Java technology-­‐based search library that offers speed, relevancy ranking, complete query capabilities, portability, scalability, and low overhead indexes and rapid incremental indexing. Solr is the Lucene Search Server. It presents a web service layer built atop Lucene using the Lucene search library and extending it to provide application users with a ready-­‐to-­‐use search platform. Solr brings with it operational and administrative capabilities like web services, faceting, configurable schema, caching, replication, and administrative tools for configuration, data loading, statistics, logging, cache management, and more. Lucene presents a collection of directly callable Java libraries and requires coding and solid information retrieval experience. Solr extends the capabilities of Lucene to provide an enterprise-­‐
ready search platform, eliminating the need for extensive programming. Solr provides the starting point for most developers who are building a Lucene-­‐based search application. It comes ready to run in a servlet container such as Tomcat or Jetty, making it ready to scale in a production Java environment. With convenient ReST-­‐like/web-­‐service interfaces callable over HTTP, and transparent XML-­‐based configuration files, Solr can greatly accelerate application development and maintenance. In fact, Lucene programmers have often reported that they find Solr contains “the same features I was going to build myself as a framework for Lucene, but already very well implemented.” Using Solr, enterprises can customize the search application according to their requirements, without involving the cost and risk of writing the code from the scratch. Lucene provides greater control of your source code and works best in development environments where resources need to be controlled exclusively by Java API calls. It works best when constructing and embedding a state-­‐of-­‐the-­‐art search engine, allowing programmers to assemble and compile inside a native Java application. While working with Lucene, programmers can directly control the large set of sophisticated features with low-­‐level access, data, or state manipulation. Enterprises that do not require strict control of low-­‐level Java libraries generally prefer Solr, as it provides ease of use and scalable search power out of the box. The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 24
As functional siblings, Lucene and Solr have become popular alternatives for search applications; the two differ mainly in the style of application development used. Key benefits of search with Lucene/Solr include: •
Search Quality: Speed, Relevance, and Precision Lucene/Solr provides near-­‐real-­‐time search and strong relevance ranking to deliver contextually relevant and accurate results very quickly. Tailor-­‐made coding for relevancy ranking and sophisticated search capabilities like faceted search help users in sorting, organizing, classifying, and structuring retrieved information to ensure that search delivers desired results. Search with Lucene/Solr also provides proximity operators, wildcards, fielded searching, term/field/document weights, find-­‐similar functions, spell checking, multilingual search, and much more. •
Lower Cost and Greater Flexibility, Plug and Play Architecture Lucene/Solr reduces recurring and nonrecurring costs, lowering your TCO. As open source software, it does not require purchase of a license and is freely available for use. The open source code can be used as is, modified, customized, and updated as appropriate to your needs. Solr is easily embedded in your enterprise’s existing infrastructure, reducing costs of installation, configuration, and management. •
Open Source Platform for Portability and Easy Deployment Because Lucene/Solr is an open-­‐source software solution, it is based on open standards and community-­‐driven development processes. It is highly portable and can run on any platform that supports Java. For instance, you can build an index on Linux and copy it to a Microsoft Windows machine and search there. This unsurpassed portability enables you to keep your search application and your company’s evolving infrastructure in tandem. Lucene, in turn, has been implemented in other environments, including C#, C, Python, and PHP. At deployment time, Solr offers very flexible options; it can be easily deployed on a single server as well as on distributed, multiserver systems. •
Largest Installed Base of Applications, Increasing Customer Base Lucene/Solr is the most widely used open source search system and is installed in around 4,000 organizations worldwide. Publicly visible search sites that use Lucene/Solr include CNET, LinkedIn, Monster, Digg, Zappos, MySpace, Netflix, and Wikipedia. Lucene/Solr is also in use at Apple, HP, IBM, Iron Mountain, and Los Alamos National Laboratories. The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 25
•
Large Developer Base and Adaptability As community developed software, Lucene/Solr provides transparent development and easy access to updates and releases. Developers can work with open source code and customize the software according to business-­‐specific needs and objectives. Its open source paradigm lets Lucene/Solr provide developers with the freedom and flexibility to evolve the software with changing requirements, liberating them from the constraints of commercial vendors. •
Commercial-­Grade Support for Mission Critical Search Applications from Lucid Imagination Lucid Imagination provides the expertise, resources, and services that are needed to help enterprises deploy and develop Lucene-­‐based search solutions efficiently and cost-­‐effectively. Lucid helps enterprises achieve optimal search performance and accuracy with its broad range of expertise, which includes indexing and metadata management, content analysis, business rule application, and natural language processing. Lucid Imagination also offers certified distributions of Lucene and Solr, commercial-­‐grade SLA-­‐based support, training, high-­‐level consulting and value-­‐added software extensions to enable customers to create powerful and successful search applications. The Case for Lucene/Solr: Real World Search Applications
A Lucid Imagination White Paper • January 2010 Page 26