The technological and strategic wall that the information retrieval and search engine companies such as Verity Corp and Fulcrum Technologies Inc have hit in the past year carries more than one message. First, it says that customers want better, more stable, manageable ‘solutions’ – that is relatively easy to respond to through integration, service and marketing. But it also says the customers don’t believe the technology is sophisticated enough. The problems have moved on, and so must the technology. That challenge is attracting dozens of new companies into the market. And, fearing that they may be left behind, it is also spurring a wave of new investments, in both R&D and acquisitions, among both the search software companies and other leading software companies such as Microsoft and Oracle. Fulcrum, for example, paid $1m in July for Central Resource Laboratories, a small unit of the British based company Thorn EMI which specialises in information clustering technology. It has also licensed technology from ERLI, a French company specializing in semantic analysis of meaning. The promise can be seen clearly enough in just one building in a small high-technology business park on the outskirts of Cambridge, England. Within two doors of each other are Autonomy Corp and Muscat, both companies set up by Cambridge University graduates to exploit pattern recognition technology in search engines.

By Andrew Lawrence

Autonomy Corp, which has major contracts to provide technology to Rupert Murdoch’s News International and with an as yet unspecified set-top box manufacturer, has raised $45m in venture capital from Apax and English National Investment Company – a huge amount by European standards. These investors hope to realise their investment early in 1998 when Autonomy floats on Nasdaq. Muscat, meanwhile, which supplies advanced search technology to Reuters, took an easier route. It sold 70% of its equity to MAID Plc, a UK-listed supplier of online services for an unspecified sum early in 1997. This deal, says managing director John Snyder, will provide the company with the resources to establish a much higher profile for what he claims is a world- beating technology. But what is wrong with the current generation of search engines? The answer, say evangelists of the new systems, is that they are inaccurate, crude, difficult to use and unintelligent. Analysis based on frequency lies at the core of products from Verity, Fulcrum, Oracle, and many others, supplemented by a number of other techniques which add ‘contextual intelligence’. But the two British companies, Autonomy and Muscat, claim to go much further. Dr Mike Lynch, chief executive of Autonomy, says that most context based search tools are sellotape technology … they try to fix something that is broke. Autonomy treats searching for text like a pattern recognition problem, like identifying fingerprints or licence plates. At the heart of both company’s technology is the Dynamic Reasoning Engine, based on Bayesian mathematics, which analyses the way words and associated words are scattered across documents. We do not score on frequency, but on relevance, says Snyder of Muscat. Autonomy also says it uses other advanced techniques, including ‘information theory’, developed by the mathematician Claude Shannon in the 1930s as a way of filtering out ‘noise’ in telecommunications systems. The theory is used to filter out information which does not add to what is already known. To add spice to its planned IPO, the company also uses neural network techniques. This technology is being increasingly used in data mining to discover unforeseen relationships between facts hidden in structured databases. Autonomy applies the same techniques to unstructured data to find similarities between documents. The use of neural networks is proving controversial in search engines – perhaps because, like the use of artificial intelligence, it is a technology which is little understood and much hyped. Many analysts believe that it cannot add significantly to the ultimate accuracy of a search. Neural networks is a jazzy term to say ‘something clever is going on here,’ says Snyder. One company that certainly contests this view is Aptex Software, a subsidiary of the data mining and neural networking specialist HNC Software. It has developed a technique called ‘content mining’, based on neural networks and what it calls ‘content vectors’ which can be trained to recognize similarities between documents and then summarize and categorize them. Yet another technique being investigated is semantic and structural analysis of text. This is the most ambitious of all, and evolves from research into language and signs — semiotics. This technique is not just about analyzing the simple relationship between words, but involves building up an ‘intelligent’ view of what words mean. Later this month, Semio Corp, a US/French company will launch SemioMap, a tool which can be used to analyze the actual meaning of sentences in documents. This tool will – if it works as claimed – enable people to ask direct ‘what’ and ‘why’ type questions – as if the computer, in fact, understood English. Because of the many technical and logistical problems involved in taking search engines to this high level of sophistication, some researchers believe the best way forward is to use – at least to help the process – a form of tagging. This means that the content of the document, along with other data, can be flagged in a small machine readable file delivered with it. Although this approach has promise for some applications, it is considered unwieldy by most software suppliers. The search for the ultimate application of search engines does not stop there. As Lynch of Autonomy points out, You shouldn’t get hung up on the search engine.

Front line of the battle

Once an accurate, contextual analysis can be carried out, it becomes possible to automatically categorise documents – for example into groups based on a user’s changing criteria. Many of the leading suppliers now claim to have powerful categorization tools. This is currently the front line of the battle between search engine companies. In spite of the investment required, many big user companies have changed their search engine more than once on the basis of the categorization features. Categorization is also linked to visualization – a way of grouping documents into themes on a screen, so that users can jump between areas of interest, drilling down when they see something that interests them. This is an area where yet more young companies are emerging – for example, Thememedia in California, a start-up which will soon announce its first products. All of these new companies are offering exciting, innovative technology. But will they prove to be any more profitable and resilient? Certainly, the example set by other listed search companies at present is not promising – the sector has yet to settle into a pattern of dominant technologies, with a stable, reliable business model. The first flush of tools are arriving, but they have to prove useful.

A longer version of this article first appeared in Computer Business Review.