EECS Industrial and Public Relations Office and CITRIS presents CITRIS logo Richard Newton Pravin Varaiya Jean Paul Jacob and Richard Morris
IBM-Berkeley Day 2004, Thursday, May 20, 2004, Wozniak Lounge, Soda Hall, UC Berkeley
Mining Your Business
Registration IBM-Berkeley day archives

Search and Mining as a Disruptive Technology

Search, combined with the increasing availability of online text, is already a disruptive technology -- search allows direct access to information that had previously been organized by libraries, book stores, consultants and other experts, and news media. For many users, keyword search and page-ranking algorithms have displaced the Dewey Decimal system, book categories, consultant reports, newspaper sections, and catalog categories as a way to zero in quickly on where to find relevant information, and electronic access has replaced visits to physical institutions. Search is being embedded into knowledge management and enterprise applications and is beginning to replace traditional enterprise systems as a way for employees to locate information. These innovations are truly terrific because they make information instantly available that we used to spend a lot of time looking for.

While search is a tremendous automation tool that makes us more efficient, text mining goes beyond efficiency-oriented applications to enable radical improvements in key business processes, such as scientific discovery, risk identification, customer relationship management, and market research. These new applications are disruptive in a less direct way than search, but with enterprise-level impact. For example, text mining makes it possible to find hidden linkages between people that can be important for risk assessment, and between companies for competitor assessment. By "reading" the flow of electronic discussion on newsgroups, blogs, and other internet content, companies can continuously monitor their reputation, what consumers are thinking about when making a purchase decision, and how these are changing. The ability to "sense" market changes almost instantly is putting pressure on companies to respond equally quickly. As time cycles are shortened from months, to weeks, to overnight, many of the tried-and-true business processes for supplying refined research to decision makers are being challenged by less-digested, but more current information.

Time AGENDA Location
Chair: Jean Paul Jacob, IBM Research
9:30 - 10:00am Arrival, Registration, and Coffee 306 Soda
10:00 - 10:15am Remarks and College Introduction by Dean Richard Newton 306 Soda
10:15 - 10:30am Introduction to the Haas School of Business and its interdisciplinary programs such as the Management of Technology Program by Dean Tom Campbell 306 Soda
10:30 - 10:45am Remarks by Dr. Robert Morris, IBM Research V.P. and Director of the IBM Almaden Research Center 306 Soda
10:45 - 11:30am C.S. Prof. Eric Brewer
Search Engines as Databases: A Inktomi Retrospective
Search engines are the highest-volume databases by far, with over 100M queries per day. Remarkably, traditional database technology has played essentially no role is this arena. In this talk, I retrospectively describe the Inktomi search engine as a database, covering its architecture and exploring why traditional DBMS solutions fall short. I then argue that the future of search depends on merging the two areas to enable both more powerful queries and combinations of structured and unstructured data.
306 Soda
11:30 - 12:15pm Robert Carlson, IBM Research V.P. and CEO of Webfountain (abstract) and Daniel Gruhl, IBM 306 Soda
12:30 - 1:30pm Lunch Wozniak Lounge
12:30 - 1:30pm

Demonstrations

Wozniak Lounge

W. Scott Spangler
Demo 1: Business Intelligence Workbench using WebFountain technology
Most internet search results returned by commercial search engines are inaccessible. Business Intelligence Workbench (BIW) uses WebFountain to retrieve and analyze the entire search result in order to produce a complete picture of the concept space. The BIW approach is a mixed-initiative text mining process that uses text clustering along with user requirements and expertise to cull and then categorize the search results retrieved from WebFountain. Multiple taxonomies can be produced both at the page level and at the "snippet" level. Analyzing these taxonomies against each other and against structured information (e.g. date of page) produces unique insights that cover the entire web as opposed to a few selected pages.


Allen Cypher
Demo 2: WebFountain Discovery Solution

A search result can typically return millions of pages. The WebFountain Discovery Solution provide new methods for the user to understand the contents of this return set. This solution will demonstrate the power of exploring billions of web pages to answer specific business questions.


Keiko Kurita
Demo 3: A Powerful New Way to Monitor and Discover Reputational Issues
With the explosion of both traditional and new information outlets—including local and global newspapers, magazines, bulletin boards and newsgroups, blogs, wire services and trade publications—critical factors affecting the reputation of organizations and brands aren't easily discernable from the noise. It is more challenging than ever to take an active role managing an entity's reputation. IBM and Factiva have partnered to bring IBM's cutting edge text analytics platform to corporate executives and practitioners concerned with reputation management. Learn about how IBM's WebFountain technology works. And, see a demonstration of Factiva Insight for Reputation, a powerful new tool for simultaneously discovering the emerging business issues and the developing social trends affecting an organization's greatest asset –its reputation.

Afternoon Sessions

1A 1:30 - 2:30pm

Do Opinions on the Web Predict Real-World Actions?
Or, Does Web Chatter Really Matter?
Wozniak Lounge
Blogs, online discussion groups, online opinion forums, active websites and electronic newsletters and bulletins from schools, clubs and organizations have given us unprecedented access to unfiltered, snippets of public opinion, millions of comments every day, spanning nearly every conceivable topic of general interest, and even more topics of interest to small groups of specialists, hobbyists, and fans. This is the passion of the planet, unfolding in front of us.

Astonishingly, new technologies give us the ability to gather in tens of millions of pages of information daily and analyze it to discover patterns, trends and relationships within this public discourse. This is new territory and we don't yet know how best to make sense of this flood of information. How can we best track how an idea or piece of information spreads on the Internet? How can we predict whether an idea is about to accelerate into public consciousness, or drop out of discussion? How can we separate spam, misinformation, and commercial chatter, from personal opinion? What are the linkage points between online opinions and activity ITRW (in the real world)?

Technical Questions

This discussion will focus on potentially productive ways to link online opinions with real-world action.

Chair: Tom Kalil, CITRIS and Special Assistant to Chancellor for Science and Technology
Panelists (each presentation 15 minutes)

Ross Nelson (IBM)
Communities, relationships, and content: Reading the web
Abstract: WebFountain is using and exploring a number of technologies to extract information from the web. Many of these techniques are being put into use today, but are still relatively unsophisticated. Nonetheless, there is valuable information to be gleaned. These techniques include disambiguation (entity and geography), NLP (sentiment), link analysis (relationship).

Prof. Henry Brady, Director of UC Data/Survey Research Center, Dr. Fredric Gey, Research Associate and Director of Technical Services, UC DATA
Survey Researchers have been trying to use the web to get a representative sample of American opinion, but there are many challenges confronting surveys on the web. Another model of the web is that it is, like newspapers or television, simply another source of information that needs to be studied by social scientists. We discuss both models and what social science has to say about them. We pay special attention to the problems of statistical sampling, statistical information gathering, and dealing with the multilingual web.

C.S. Prof. Stuart Russell
Truth and Appearance
Abstract:Many classical applications of probabilistic inference involve extracting the hidden truths of the world from the messy appearance they present. With Web, text, and image data, the hidden truths concern objects and events that exist, have certain properties, are related in certain ways, and participate in various ways in the processes that generate appearance. I will describe briefly how to formulate and apply probability models that let us extract truths of this kind from raw data.

1B 2:30 - 3:30pm Internet Data and Business Processes
(or, May the Web Be With You)
Wozniak Lounge

The Web contains a wealth of information and opinion about companies, products, and services in web pages, blogs, opinion sites, news groups, and self-publication efforts. This information comes from the companies themselves, from consumer advocates and watchdog groups, and from consumers. How valuable is this information? What are companies learning? Can it be used for business decisions? In addition, eCommerce sites gather user profile and preference information. How can this information be effectively used, yet still protect individual privacy? And finally, how can we better understand and monitor the basic internet infrastructure -- is it reliable, is it healthy, is it at risk?

Technical Questions

Chair: Wayne Niblack

Panelists (each presentation 15 minutes)

Wayne Niblack
MarketInsights: Aggregating Company/Brand Information using the Web Fountain Platform
Abstract:Most commercial search engines try to find relevent pages to a user's keyword/keyphrase queries. MarketInsights tracks issues and trends related to user-defined subjects, such as Companies, Brands, or People. It retrieves documents containing the subjects from multiple sources (web, newsgroups, blogs, syndicated new sources, ...), disambiguates the subjects per mention, finds mention of issues around the subjects, find trends and discovers new issues. Results are presented in a UI that allows a user to compare multiple subjects and issues, slice and dice by various criteria, and drill down to source documents.

Prof. John Canny
Personalization, Mining and Privacy
Individuals desire for privacy apparently conflict with business' desire to match the right products with potential consumers. However, it is possible to provide very accurate personalization with complete privacy protection. Similar techniques can be used to compute most useful kinds aggregate information on user data with provable protection of individual data. Such methods potentially allow businesses to make use of user purchase and preference information beyond what is immediately available from their site. Recent progress in inference and privacy protection methods will be described which allow them to scale to large e-commerce applications.

Prof. Joe Hellerstein
PHI: Public Health for the Internet
Internet Security is often cast using medical metaphors: viruses, vaccines, and the like. Security researchers have called for a "Center for Disease Control" for the Internet, but this has led to concerns about the stewardship of such a center, and its degree of control over the Internet. In the PHI project, we are taking a grassroots, community-oriented approach to monitoring security and performance data on the Internet. Using a peer-to-peer query engine we have built at Berkeley called PIER, we are designing an end-user p2p application that provides individuals with a window onto the epidemiology of the Internet, and gives them global perspectives on their personal Internet "health" risks. By running the application, end-hosts also serve as "sensors" in a global monitoring infrastructure. The design of the system raises a number of challenges in distributed query processing and data analysis, as well as in peer-to-peer privacy and security; it also provides a number of unique opportunities for research in Internet security and performance. PHI is an example of the latent potential of distributed, structured data that is largely unexploited today, but may be of keen interest in the post-web era of Internet-scale data analysis.

1C 3:30 - 4:30pm Search and Mining in Support of Disruptive Business Models Wozniak Lounge
Disruptive business models involve a decisive shift in some aspect of how a company operates or how a market is served. Search involves a radically different approach to information access, and Internet search has already disrupted long-standing business models in scientific and general news publishing. Text mining similarly has the potential to disrupt established ways of gathering market information and making risk assessments. By using text mining, market information has the potential to become: more granular and specific, more current, more comprehensive, and reflect more spontaneous viewpoints. At the same time this research will be less statistically rigorous in some important ways, as the sample is difficult to control and interpret. How will this shift in the "sensing" of market trends disrupt current business models? What opportunities will these disruptions create?

Technical Questions

Chair: Steve B. Cousins, IBM Research
Panelists (each presentation 15 minutes)

Rakesh Agrawal (IBM Research Fellow)
Sovereign Information Sharing, Search, and Mining
Emerging networked economy might disrupt traditional organizational models and lead to the creation of on demand virtual oganizations. We will present some thoughts on information sharing, searching, and mining in such environments.

Robert Carlson
Search Technology as a Change Agent
Abstract: As the first ebusiness investment cycle ends companies are faced with a growing challenge that will only increase. Search technology has fundamentally changed the relationship between individuals and the companies that provide the products that they buy. Search has liberated customers from the channels that companies have established to deliver their products, search has created the most informed consumers that companies have ever had to engage, and search allows us to shop the world. And search also allows an unhappy customer to immediately tell 50 million households her opinion of the product or service that did not meet her expectation. As financial success is impacted by search we believe companies will need new tools and transform business processes to continue to compete.

C.S. Prof. Michael Franklin
Architectures for Large-scale Data Collection, Dissemination, and Personalization
Emerging networked information systems provide unprecedented opportunities for collecting, aggregating, and filtering data from a wide range of sources and disseminating it to large populations of users, applications, or devices. In this talk I will describe our on-going development of two currently separate, but clearly related data management architectures: the HiFi system for aggregating information from streaming data sources and ONYX, a high-performance, high-function XML message brokering architecture for large-scale personalized content delivery.

Summary Session in 306 Soda
4:40 - 5:15pm What Happened Today? Summary by Kevin Mann, IBM Research 306 Soda
5:15pm Wine and Cheese Reception, hosted by CITRIS Wozniak Lounge