Saturday, March 24, 2007

The end of federated search?

Thinking more about my last post, "Google and the deep web", Google appears to me to be rejecting federated search, instead preferring a local copy of all the world's data on the Google cluster.

Federated search (or metasearch) is when a search query is sent out to many other search engines, then the results merged and reranked.

In more complicated forms, the federated search engine may build a virtual schema that merges all the underlying databases, map the original query to the different query languages of the individual source databases, only query the databases that have a high likelihood of returning good answers, resolve inconsistencies between the databases, and combine multiple results from multiple sources to produce the final answers.

The Googlers on the "Structured Data Meets the Web: A Few Observations" (PS) Dec 2006 IEEE paper make several arguments against this approach succeeding at large scale:
The typical solution promoted by work on web­data integration is based on creating a virtual schema for a particular domain and mappings from the fields of the forms in that domain to the attributes of the virtual schema. At query time, a user fills out a form in the domain of interest and the query is reformulated as queries over all (or a subset of) the forms in that domain.

For general web search, however, the approach has several limitations that render it inapplicable in our context. The first limitation is that the number of domains on the web is large, and even precisely defining the boundaries of a domain is often tricky ... Hence, it is infeasible to design virtual schemata to provide broad web search on such content.

The second limitation is the amount of information carried in the source descriptions. Although creating the mappings from web­form fields to the virtual schema attributes can be done at scale, source descriptions need to be much more detailed in order to be of use here. Especially, with the numbers of queries on a major search engine, it is absolutely critical that we send only relevant queries to the deep web sites; otherwise, the high volume of traffic can potentially crash the sites. For example, for a car site, it is important to know the geographical locations of the cars it is advertising, and the distribution of car makes in its database. Even with this additional knowledge, the engine may impose excessive loads on certain web sites.

The third limitation is our reliance on structured queries. Since queries on the web are typically sets of keywords, the first step in the reformulation will be to identify the relevant domain(s) of a query and then mapping the keywords in the query to the fields of the virtual schema for that domain. This is a hard problem that we refer to as query routing.

Finally, the virtual approach makes the search engine reliant on the performance of the deep web sources, which typically do not satisfy the latency requirements of a web­search engine.
Google instead prefers a "surfacing" approach which, put simply, is making a local copy of the deep web on Google's cluster.

Not only does this provide Google the performance and scalability necessary to use the data in their web search, but also it allows them to easily compare the data with other data sources and transform the data (e.g. to eliminate inconsistencie and duplicates, determine the reliability of a data source, simplify the schema or remap the data to an alternative schema, reindex the data to support faster queries for their application, etc.).

Google's move away from federated search is particularly intriguing given that Udi Manber, former CEO of A9, is now at Google and leading Google's search team. A9, started and built by Udi with substantial funding from Amazon.com, was a federated web search engine. It supported queries out to multiple search engines using the OpenSearch API format they invented and promoted. A9 had not yet solved the hard problems with federated search -- they made no effort to route queries to the most relevant data sources or do any sophisticated merging of results -- but A9 was a real attempt to do large scale federated web search.

If Google is abandoning federated search, it may also have implications for APIs and mashups in general. After all, many of the reasons given by the Google authors for preferring copying the data over accessing it in real-time apply to all APIs, not just OpenSearch APIs and search forms. The lack of uptime and performance guarantees, in particular, are serious problems for any large scale effort to build a real application on top of APIs.

Lastly, as law professor Eric Goldman commented, the surfacing approach to the deep web may be the better technical solution, but it does have the potential of running into legal issues. Copying entire databases may be pushing the envelope on what is allowed under current copyright law. While Google is known for pushing the envelope, yet another legal challenge may not be what they need right now.

9 comments:

Anonymous said...

Google's approach sounds on the face of it like draining the provider database of its value and injecting it into google instead. Is that a viable business approach once the providers get wise to it?

Govind Kabra said...

Hi Greg,
You gave a very good summary of the Google's approach, but I suspect making a local copy of all the data using "deep crawl" is perhaps not the right approach for Deep Web.

1. The number of queries you need to pose for each form is huge. Specially, if you have text fields. Further, the scenarios where cardinality of input values is virtually infinite.

2. How do you model all queries to all these diverse forms? Going back to author's comment that there are a wide range of domains. How do derive the set of queries to be used to probe each of the sources?

3. The load on server for each probing query is much more than simple crawling of a surface web page. It would generally require a database query on the server side.

4. The data is quite dynamic. Such as price information - in much of the ecommerce domains, such as flights, hotels, etc. Given the required load for one cycle of "deep crawl", you cant have to many refresh cycles.

5. Keeping all the query enumeration, schema matching and mapping challenges aside, web sources imposes very unique challenges. There are a very few "get" based forms these days; Such offline "deep crawl" would not be *easy* to deploy in forms involving POST and javascripts.

Ofcourse, there are many more challenges in handling deep web content. In above, I have pointed out the ones that are very specific to "deep crawl" approach.

I work in a start up named Cazoodle, and we are taking a very different approach to solve the challenges in Deep web. We will be presenting our system prototype in ICDE. Hope to catch up with you if you are flying in there.

Greg Linden said...

Good points, Govind. The issue of the data in the copy being stale is a particularly big one, I agree. It is also true that, in some cases, Google may only be able to do a deep crawl of a fraction of the data that potentially might be available from a deep web source.

I am only guessing on this, but I suspect Google may be thinking of imports into Google Base as a potential solution to this problem. The early version of that import process is described here:

http://www.google.com/base/custom.html

While that requires more cooperation from a deep web source, it is able to deal with dynamic data and data that otherwise might be difficult to access.

Anonymous said...

I also think that doing this federated search may not be such a superb idea. This idea has more of glamor ( sorta like unified string theory :) ) than reality.

Many of the systems already expose their internal data as static web pages for the browse functionality. What I mean is though the data is behind a form, but it is also in a browse structure. Classic example is Amazon - it has a great search functionality behind a form, but it allows the user to browse and hence exposes all the products in their catalogue. Price updation is an issue, but google is trying to solve it(and other things) using base.google.
The reverse integration of data into base.google is much better than federated search where one would have to take care of merging results and schema. Merging results is a pain itself, schema seems out of question :)...

Also people are getting better at defining search queries[query length is increasing], this would weaken the need for federated search from the individual databases.

Unknown said...

Are the authors really talking about indexing all deep web content or a representative sample that can then be used to derive the data structure?

The rate at which this data changes as Govind points out seems to mitigate the advantage of indexing these sources.

I can see value in having a statistically representative sample of the data for determining data structure, looking at quality, similarity, relevance etc for the source.

A returned result inherits the source's ranking rather than each individual bit of data having its own ranking.

Anonymous said...

A comment and a question.

First, I had a good chuckle at this line from the paper: "The first limitation is that the number of domains on the web is large, and even precisely defining the boundaries of a domain is often tricky ...

To me, that is the perfect argument...not against federated search, but against personalization! Think about it; if you want to model a person, or model the similarity group to which a person belongs, you have this very same problem that the number of personalizable interest domains amongst large groups of people is large, and even precisely defining the boundaries of a personalization domain is often tricky.

Perhaps if the Google Personalization team has solved this problem, the Google Federated Search team should start talking to them..?

Second, my question: Greg, you conflate federated search and metasearch. To me, federated search is simply the idea of sending your query out to disparate databases (e.g. multiple verticals) and pooling the results.

Metasearch, on the other hand, is more about sending out your query to conceptually similar search engines, and then applying some sort of data fusion or voting scheme to the results. E.g. Savvysearch, or work that Javed Aslam has done.

In federated search, the idea is that each collection that is being searched contains different data. The goal is to bring all these different data sources together into a single, unified interface.

In metasearch, on the other hand, the idea is that each collection that is being searched contains the same data; however, each engine ranks this same data in a different manner. With the assumption that two engines are independent, you can use their ranks to create a combined rank that offers even more relevance that either ranked list, independently.

In this manner, metasearch is akin to the idea of "boosting", or weak learner combination, in machine learning. You treat each search engine as an independent, weak classifier, and through metasearch you can come up with a strong engine that outperforms each individual engine.

For example, suppose I do a search on Google, Yahoo, and MSN. I get some links on each engine that are different, and some that are the same. If I treat each engine as independent, and treat the ranks at which they place documents as "votes" toward the relevance of that document, I can increase my signal to noise ratio by ranking higher documents that are voted highly by all three engines.

So to me, that is metasearch. And again, that is different from my understanding of federated search, in which you do not presume that the separate engines are all ranking over the same underlying collection. It is the different in underlying collection that makes federated search a conceptually different beast from metasearch.

So that is my question. Does my understanding of the difference between federated search and metasearch jive with your understanding? And does it jive with the Google understanding? Because as far as I can tell from the paper, Google isn't giving up on "metasearch". They're only giving up on "federated search".

Greg Linden said...

Hi, Jeremy. On federated vs. metasearch, I tend to think of metasearch as a limited form of federated search. It seems to me that metasearch is a simpler form of federated search where the underlying databases tend to have similar query languages and data formats.

That would agree with your definition and the definitions in Wikipedia, but perhaps one point of disagreement is the perceived difficulty of doing relevance rank on the merged results of a metasearch. I think the relevance rank problem there is hard (because each the source relevance ranks differ in ways that may be difficult to understand and model), but you seem to suggest that it should be easy in your comment. Is that right?

On personalized search, user modeling, and the large number of domains, I think there is a bit of an apples and oranges comparison going on there with the models necessary for federated search. I suspect the user models for that type of personalized search can be high level, basically summaries of generic interests, while the models of the schemas of the federated data sources may need to be quite detailed and specialized. Do you disagree?

Anonymous said...

Greg - If I understand you correctly, you are saying that metasearch is essentially results fusion, while federated search is query reformulation plus results fusion.

If that is the case, then I will loosely or abstractly agree that metasearch is a limited form of federated search, because you have to be able to do results-fusion in order to both.

But the difference I was trying to point out is that the actual data being merged or fused is often conceptually quite different.

My understanding of federated search is that you would issue a query like "pirates of the carribean", and then the engine would query the movie database server, the disneyland server, the Library of Congress history server, etc. And then merge or fuse the results. There is probably not too much about the movie in the Library of Congress history server, and not too much "real pirate" history in the internet movie server.

So (my understanding of) fusion in federated search context boils down to the same problem of deciding which "aspect" or "facet" of a search to show to a personalized searcher.

On the other hand, with metasearch, each search engine is working across the same corpus, and the whole point is that duplicate content is a good thing. The more often that independent search engines retrieve the same document, the higher our confidence is that the document is truly relevant.

I don't know which is easier or harder, fusion in a federated or in a metasearch context. I just know that I have personally worked with metadata fusion, and with a very simply, nay extremely simple, approach (taking the average rank from two separate engines), I have been able to achieve a statistically significant precision improvement.

So maybe metasearch fusion is easier, because such a simple algorithm does yield better results. You couldn't do the same thing for federated search, because if you are fusing the results of a movie server with a historical document server, there is nothing to average, as each collection contains different sets of documents.

But I don't actually know which is easier. My whole point was just to point out the conceptual distinction between federated search and metasearch.. and wonder whether Google is giving up on both, or just givin up on federated search.

Aruna G said...

Hello Greg ,

How do you say that Google is rejecting federated search? Google Enterprise is definitely looking at federated search - their open source connector architecture is one way to do federated search , where content that are uncrawlable due to the product architecture are crawled and added to the Google Enterprise's global index. Federated search cannot be just for deep web, it can also be for 'deep intranet'.
-Aruna G.