How to improve product search quality -seminar II v0.5 Leon Lee 2009.9 Outline 2.Review and Summary Last Seminar 3.Principle of Engineering and Research 4.About Semantic Web 5.Current Hot, Advanced Search Engine Site 6.Possible Direction of combination of AI and SE 7.Identify Our Position 8.Improving Search Quality Plan draft v0.5 9.Suggestion? 10.Other Topic 11.References 2.Review and summary last seminar 1. Basic of information retrieval Why we need a dynamic relevance scoring/ranking. 2. Scoring and boosting in lucene. How to boost a important field 3. No significant improve for thesaurus based query expansion May decrease precision, particularly with ambiguous terms. 4. Metadata/search strategy requirement from Bin abstracted as "select * from index where property1 = word1 and property2 = word2" prerequisite? example: ibm t61? 3.Principle of Engineering and Research 1. KISS: keep it simple stupid 2. Simple Intuition by naive examples and solid foundation of theory and practices. 3. Step by Step 4. Neutral academic view point. 5. We Are Our Own Worst Enemy. 6. Communication 4.About Semantic Web 1. What's Semantic Web Discussion on smth bbs, douban.com. 2. Dbpedia and wikipedia 3. CYC 4. GoodRelation GoodRelations: An ontology for linking product descriptions and business entities on the Web 5. Hownet , Hierarchical Network of Concepts in China 4.About Semantic Web Xml, RDF, RDFa, Rules, Inference ... Semantic != Semantic Web/Net Semantic in nowdays = understanding content by Machine Learing, Natural Language Processing, Information Retrieval, Data Mining ... techniques + Linked data ? 5. Current Hot, Advanced Search Engine Site & Techniques What kind of techniques does the following site use ? (1). Wolfram|Alpha What is the core technology of Wolfram|Alpha? a computational knowledge engine Four key general areas are the data curation pipeline, the algorithmic computation system, the linguistic processing system, and the automated presentation system. Wolfram|Alpha computes answers to specific questions using its built-in knowledge base and algorithms. e.g., n-grams "it was the best of times it was the worst of times" unicode 8900 to 8915 5. Current Hot, Advanced Search Engine Site & Techniques (2). yebol.com 5. Current Hot, Advanced Search Engine Site & Techniques (2). yebol.com Instead of the common “listing” of Web search queries, Yebol automatically clusters and categorizes search terms, Web sites, pages and contents. Yebol allows for a multi-dimensional search result instead of the normal one-dimensional search seen by most web search engines today. This provides a more accurate summary of top sites and categories; a wider array of related search terms; a longer and richer expansion for query results; and a deeper base of links and keywords in search result pages. Unlike current search platforms, Yebol provides hundreds of easily identified and accessibly categorized results in one easily navigable page. Yebol uses a combination of algorithms and human knowledge to build a revolutionary web directory for each search term. The Yebol engine clusters search results into groups of termspecific categories. 5. Current Hot, Advanced Search Engine Site & Techniques (2). Powerset In the search box, you can express yourself in keywords, phrases, or simple questions. On the search results page, Powerset gives more accurate results, often answering questions directly, and aggregates information from across multiple articles Powerset is working on building a natural language search engine that can find targeted answers to user questions (as opposed to keyword based search). For example, when confronted with a question of the form 'which U.S. state has the highest income tax?', conventional search engines ignore the question and instead do a search on the keywords 'state, income and tax'. Powerset on the other hand, attempts to use natural language processing to understand the nature of the question and return pages containing the answer. 5. Current Hot, Advanced Search Engine Site & Techniques (3). Freebase freebase, an open database of the world’s information. It’s built by the community and for the community – free for anyone to query, contribute to, build applications on top of, or integrate into their websites it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available via an open API. Under the hood Freebase is a graph database. This means that instead of using tables and keys to define data structures, Freebase defines its data structure as a set of nodes and a set of links that... 5. Current Hot, Advanced Search Engine Site & Techniques (4). http://www.evri.com/ You'll find profiles on people, places, books, movies, companies, events, organizations, teams, products -- all kinds of stuff. And we're adding new ones all the time. You can find quick facts and summary information, as well as recommended articles, news, blog posts, photos, tweets, and videos. On each page, we'll also highlight the top connections for each topics, and make it easy to browse and discover more relevant information. Evri’s technology automates connections between Web content by applying a more human-like understanding of the words on the page. News, Tweets, Images, Quotes, Videos, Description from Wikipedia, Products from Amazon.com, ... 5. Current Hot, Advanced Search Engine Site & Techniques (5). text runner search (Relation extraction from webpages) http://www.cs.washington.edu/research/textrunner/ (6) opinion Analizer. Search, Rate and Comare http://www.swotti.com/ (7).Product wiki. ProductWiki is the resource for free, unbiased product reports written by a dedicated community. http://www.productwiki.com/ http://www.productdb.org/ (8). Online Dictionary, Encyclopedia http://www.answers.com/ (9). wikipedia, dbpedia 5. Current Hot, Advanced Search Engine Site & Techniques (10). Google Base a place where you can easily submit all types of online and offline content, which we'll make searchable on Google (if your content isn't online yet, we'll put it there). You can describe any item you post with attributes, which will help people find it when they do related searches. (11). Yahoo SearchMonkey Share structured data with Yahoo! Search to display a standard enhanced result (xml, RDFa, feed, goodrelation, ...) 5. Current Hot, Advanced Search Engine Site & Techniques (12). Google Adwords, Adsense. (13). www.textmap.com (14). http://www.numenta.com/ HTM technology has the potential to solve many difficult problems in machine learning, inference, and prediction. Some of the application areas we are exploring with our customers include recognizing objects in images, recognizing behaviors in videos, identifying the gender of a speaker, predicting traffic patterns, doing optical character recognition on messy text, evaluating medical images, and predicting click through patterns on the web. 5. Current Hot, Advanced Search Engine Site & Techniques (15). product search engines: http://www.google.com/products http://cn.bing.com/shopping/ http://shopping.yahoo.com/ Basic functions: Filtered by cateogry, Brand, Price range, Stores, additional information: reviews. Advanced Product Search: phrase, boolean query, Search for words that occur in fields SafeSearch: Many users prefer not to have adult sites included in search results (especially if kids use the same computer). 6. Possible Direction of combination of AI and SE (1). ML,IR,NLP,DM,IE techniques we can use. Semantic Relatedness Text Classification. Text Clustering Name Entity Recognizing Html Main Content Extraction Collaborative Filtering Sentiment Analysis Opinion Mining Language Models Relevance Feedback Query Expansion Query Segmentation Relevance Ranking User Behavior Analysis Machine Translation for Cross Language Retrieval 6. Possible Direction of combination of AI and SE Trend? 7. Identify our Position Current Problems with our search engine: 1. ranking/scoring 2. production information extraction 3. Taxtonmy and Product Classification 4. lack functions of filting and advanced search 5. lack relevant information to enchance user experience 6. performance 7. no reliable distributed data store architecture 7. Identify our Position A product search engine. goal: extracted more products information. more accurate ranking order. more intellgent relevant information. (connected, structured, categorized, recent, ranked with meaning) 7. Identify our Position more specified goal: extracted more products information. (unstructed webpage to structed information with unified taxtonmy) more accurate ranking order. (state of the art information retrieval model incorporated with more factors) more intellgent relevant information. (connected, structured, categorized, recent, ranked with meaning) 8. Improving Search Quality Plan draft v0.5 (I). baseline 1.ranking module redesigned experimental search environment setup luke setup boost important field parameters adjust scoring in lucene and solr 2.searching code reconstruct easy to modify and optimize delete old queryparser which no one maintained 3. a plan on improving information extraction from product of webpages. 4. a scheme on improving taxtonmy & classification 5. adding session & click through record to query log 6. redesign workflow of tokenizer process in solr delimeter 7. research on hadoop, pig, hive, hbase. plan for migrate from mysql to distribued data store 8. a plan to integrate and simplify cache scheme 8. Improving Search Quality Plan draft v0.5 (II). Improving 1. ranking module improving integrate product information quality into ranking module modify scoring formual by TREC paper add more parameters to tune add easy UI to observe & adjust formula 2. adding advanced search functions 3. search result filtering ( adult & kids information distinguishing) 4. extract full product properties from web. (not only name, price, description) applying wrapper techniques page type identifying 5. taxtonmy integration & product classification 6. snippets / dynamic summaries 8. Improving Search Quality Plan draft v0.5 (III). advanced techniques incorporating 1. ranking module improving research state of the art IR modules like language model. 2. add relevant categorized information. 2.1 Latest product information from RSS news article 2.2 related product url/information from freebase.com, productwiki.com, wikipedia.com, ... ( rdf database or other way?) 2.3 review opinion analysis/mining 2.4 collective filtering 3. Query Segmentation, concept from query and refine queried 4. Query log analysis (using click through record ) 5. Simply Cache scheme to speed up response. 6. Using distributed data store ( hadoop, pig ...) ...... Suggestion? 9. Other Topics NLP in IR Wrong view point by most people. Performance multi index: reduce IO, put it fittable in memory or a big cache. multi FieldQueryParser multi Searcher Information Extraction Wrapper / template Algorithm of identiy type of htmls in commerce website Algorithm of extract full production information. Web noises detection and elimination Date Store using distributed map reduce techniques. Projects based on Hadoop: subprojects and related projects, including Hive, Avro, Pig, HBase, Cascading etc. References 检索技术》《信息孙建军，成颖等译，科学出版社术与系统》《搜索引擎—原理、技晓明闫宏飞王继民著，科学出版社李《Modern Information Retrieval》 Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 1999 业出版社机械工检索技术》孙建军，成颖等译，科学出版社《信息术与系统》李晓明闫宏飞王继民著，科学出版社《搜索引擎—原理、技《Modern Information Retrieval》Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 1999 Slides of "Modern Information Retrieval " Instructor: Zhang li in Tsinghua University Slides of " Advanced Topics in Information Retrieval " Zhang min in Tsinghua University 开发自己的搜索引擎——Lucene 2.0+Heriterx 编著邱哲符滔滔 lucene 分析与应用吴众欣，沈家立 Mining the Web: Discovering Knowledge from Hypertext Data, Soumen Chakrabarti,Morgan-Kaufmann Publishers Introduction to Information Retrieval,Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press. 2008. Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schütze, 1999 References Lucene and Juru at TREC 2007: 1-Million Queries Track, page 449 E. Amitay, D. Carmel, D. Cohen, IBM Haifa Research Lab Inverted index - Wikipedia http://en.wikipedia.org/wiki/Inverted_index Lucene / Solr 开发经验(ZZ) http://westguar.spaces.live.com/blog/cns!F6DD25E77539E5DD!358.entry?wa=wsignin1. 0&sa=89479379 Analyzers, Tokenizers, and Token Filters, Solr Wiki, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head1c9b83870ca7890cd73b193cefed83c283339089 Semantic web discussion on smth bbs, http://www.newsmth.net/bbstcon.php?board=SearchEngineTech&gid=15880 Semantic web discussion on douban, http://www.douban.com/group/topic/6727027/ Goodrelations http://www.heppnetz.de/projects/goodrelations/ Better Search with Apache Lucene and Solr http://trijug.org/downloads/TriJug-11-07.pdf Crowdsourcing for relevance evaluation, Alonso, O.; Rose, D. E. & Stewart, B. SIGIR Forum,Vol. 42,pp. 9-15,2008 Relevance judgments between TREC and Non-TREC assessors, Al-Maskari, A.; Sanderson, M. & Clough, P.SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval,pp. 683-684,2008 Debugging Relevance Issues in Search http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/DebuggingRelevance-Issues-Search Optimizing Findability in Lucene and Solr http://www.lucidimagination.com/Community/Hear-from-theExperts/Articles/Optimizing-Findability-Lucene-and-Solr