Robert Inder
7th May, 1999
Conventional Knowledge Based Systems (KBSs) involve the controlled manipulation of symbolic descriptions of the world.
Extracting such symbolic descriptions from sensory data was (and still is) a major challenge in its own right, addressed by specific fields as speech recognition and vision, and most work in KBSs tries to sidestep it.
The first KBSs worked with very restricted amounts of information (e.g. chess positions) and typically solving problems described in special format data files.
KBSs started to achieve much wider use/acceptance when Expert Systems began to ask questions of users, (``Has the patient got a rash?''), thereby using the user's perceptual abilities and common sense to analyse the world into the abstract terms that the KBSs required. Subsequently, KBSs began to access data, either statically (i.e. KBS linked to database) or by analysing real-time feeds (e.g. network monitoring). To a greater or lesser extent, these systems can determine what data is relevant, and when, but only within a fixed (or at least limited) range of data for which the format and semantics were known to the system builders (or maintainers).
Now, we have the Internet.
By providing ``universal'' connectivity between computers, the Internet has created a previously inconceivable range of possibilities for any program. Innumerable documents -- ranging from the ephemera of electronic ``news'' discussions through to published documents (including newspapers, literature and technical documents) -- can be retrieved from almost anywhere in the world within moments. Programs can instantly query a multitude of databases, covering everything from holiday prices to registered trade marks, and call on other computational services such as dynamic information feeds (e.g. stock prices), notification of changes and so forth. Finally, the Internet allows programs to easily interact with huge numbers of people, either by email or using Web-based forms.
The number of sources of information and services continues to grow at an incomprehensible pace. Crucially, though, those sources are outside the control of--indeed, since they are continually growing and evolving, outside even the knowledge of--the system's builders.
Fully exploiting the opportunity presented by the Internet--e.g. by building systems which can analyse and filter information for a particular purpose--will in several ways require a new breed of KBSs. They will, at the very least, have to be easily configured to deal with new sources of information, and should, more importantly, be able to explore, discover and operate within a huge and ever-changing world of information.
By making it easier to both locate and distribute software, the rise of the Internet has also greatly increased the number that readily accessible software packages. These include not only end-user applications, ``plugins'' and extensions for system tools and programming languages, but also library components for more-or-less standard tasks, such as parsing standard document formats and mark-ups.
In particular, there are efficient implementations of data-intensive algorithms such as neural nets, statistical analysis or ``data mining'' tools, and information retrieval and other techniques for processing free text based on word occurrence statistics. Such packages are potentially valuable new sources of information. It will therefore be increasingly important that software tools are able to work with them--either by incorporating libraries or by interfacing to stand-alond packages.
To best exploit the opportunities presented by the rise of Internet, a KBS programming tool should offer...
One way to achieve this is by extending an existing language which has some
of the desired properties (e.g. by implementing a rule-based engine within
Java, such as the Java Expert System Shell (Jess)--see
http://herzberg1.ca.sandia.gov/jess/
).
The alternative is to combine two (or more) languages which each exhibit some of the desired properties into a single programming environment, in the manner of Poplog (Sloman and Hardy, 1983). This is the approach adopted by CAPE.
CAPE is a programming environment which allows programs to be written in an intimate mixture of the CLIPS (CLIPS: C Language Integrated Production System) rule-based and object-oriented language, and Perl, a procedural programming language.
CLIPS was chosen because it closely integrates a very fast forward chaining rule-based system with a flexible object system (supporting both message passing and generic functions). CLIPS was initially a partial re-implementation, in C, of Inference Art (Inder, 1989), which was arguably the most powerful of the lisp-based ``knowledge representation tookits'' that emerged during the mid/late 1980s. Its rule language features very powerful and efficient pattern matching facilites based on the RETE algorithm (Forgy, 1982), and including the ability to match against the state of objects, and a truth maintenance mechanism. Information and software related to CLIPS is accessible via, http://www.ghgcorp.com/clips/ Perl was chosen because of its extremely powerful regular expression matching facilities, and its huge library of freely available software modules. It also supports complex data structures, and (combinations of symbolic patterns), Information and software related to Perl is accessible via, http://www.perl.com/ . Note that CAPE takes a fundamentally different approach to combining languages than that taken by Poplog. Poplog combined languages by creating a common virtual machine and implementing the various languages within it (Smith, Sloman and Gibson, 1983). This allowed seamless integration with minimal performance overhead. In contrast, CAPE includes the standard language engines for both CLIPS and Perl. This has the advantage of ensuring that the functionality of these languages is provided correctly, so that all existing programs and libraries can be expected to work correctly. It also greatly reducing the implementation effort and maintenance effort, particularly when new features are added to the underlying languages.
To the two separate language sub-systems, CAPE adds inter-calling and data transfer. It also provides mechanisms for synchronising the initialisation of data structures between the two programming systems. Finally, it uses CLIPS's run function facility to support monitoring Internet sockets while reasoning, allowing the system to remain responsive to socket activity while reasoning.
The lowest level handling of socket activity is done in C within the core of CAPE. A function called after each rule firing accept connections to any ports being monitored, and aggregates any data received from any current connection. Then, whenever the buffer is full, or a Record break character (currently new line) is received, the accumulated byte string is passed to a connection-specific Perl function for filtering. This function could simply respond directly to the input received. Typically, though, it will map it into CLIPS working memory, thereby allowing the full power of the pattern matching to be used to decide when and how to respond.
Listener itself is in Perl, so can be redefined, and thus customised or extended, by user.
The CLIPS rule engine has a very powerful mechanism for deciding which rule to fire at any point. This can be used to make extremely subtle and flexible decisions about the flow of control within the system--i.e. about what the CAPE application should do next. Doing this requires structuring the system as a whole as a rule-based system, so that firing rules triggers activities that are possibly implemented in Perl. Such a system will also remain responsive to external events (e.g. socket activity), provided that the code executed by any particular rule does not take too long to run.
Perl to help map ``the world'' into symbols that CLIPS can then reason about, and then to interpret the symbolic results of that reasoning into actions in the world.
CAPE has been designed (and primarily tested and exercised) with this model of system operation in mind. However, there is no (known!) obstacle to building an ``inverted'' system -- i.e. one in which overall control resides in a Perl program (or an interface generating call-backs into Perl), within which some subroutines specify or initiate rule-based reasoning.
The functionality of the current version of the system has been stable since the spring of 1998, and the system itself has been used for the development of substantial research systems since the following summer.
Three demonstration systems, which form part of the CAPE distribution:
The entire application was build from scratch in five days. It contains a total of 2800 lines of CAPE code (about one third Perl), although re-implementation using the webserve component application would reduce this substantially.
In addition to the demonstration and component applications mentioned above, CAPE is being used to develop a number of research systems. Most of this work relates to DIME: the Distributed Information Manipulation Environment. DIME is a collection of components useful for building systems to retrieve and manipulate distributed data. These components include a highly flexible document cache (which is able to fetch documents when necessary, store them locally, and search them in various ways) and a framework for specifying and controlling interaction with remote search systems.
DIME is being used to develop two research systems. K-DIME, or Kansei DIME (Inder, Bianchi and Kato, 1999) is concerned with fetching and filtering images based on subjective criteria, or kansei. Users connect to K-DIME using a Web browser, and specify their requirements in terms of both the objective and subjective properties of the images. K-DIME then queries web-accessible search engines, such as the Alta-Vista Photofinder, for relevant images based on the objective criteria. It then analyses the results of these queries, retrieves the relevant images and then chooses and configures external ``oracles'' (currently neural nets) to assess them against the subjective criteria specified by the user. K-DIME is responsible for all aspects of the overall control of the operation, such as deciding which oracles to run and when, setting filter thresholds and so forth.
The second system being build using DIME is Maxwell, a ``smart'' agent intended to both query and combine the results from a number of book vendors databases on behalf of the user. An initial version of Maxwell was implemented in Emacs Lisp (Inder and Kato, 1998; Inder, Hurst and Kato, 1998), and a more powerful version is currently being implemented in CAPE using DIME.
Apart from DIME, CAPE has been used to build a system for generating graphical maps of web sites. Given an initial URL and a regular expression to delimit the area to be mapped, the system fetches the relevant pages and analyses them to extract the links that they contain. The eventual aim is to generate a graph and output it in the format required by a graph layout system--specifically, the one found in the ``thinking support tool'' D-Abductor (Sugiyama and Misue, 1991).
However, although one can trivially generate a graph description from a web site, such approaches scale very badly: even modest web sites give rise to graphs that contain too many nodes to be drawn effectively. Producing workable graphs, therefore, involves either generating or recognising structures that can be used as a basis for omitting or temporarily hiding parts of the graph.
The mapping system uses Perl for
However, the top-level of the system is written in CLIPS. In particular, the system's ontology is described by defining CLIPS objects for such concepts as pages, page sets, servers and directories. CLIPS rules and methods are then used for mapping pages into suitable graph descriptions. The current prototype contains just over 40 rules, doing such things as:
This approach is taken to ensure that, as far as possible, rules can be written at the domain level, without reference to the mechanics of fetching or searching documents. The figures give some idea of the extent to which this has been achieved. In particular, the rule for detecting ``chains'' of similar links can be written entirely in terms of facts describing the links that have been found in the various documents. These facts can be generated in CAPE in a single line of code, just by requesting the matches for an appropriate regular expression. THey could, of course, be generated in CLIPS, but it would require considerably more error-prone and tedious programming.
Note that the left-hand side of the rule decide-to-fetch-page includes a call to match a perl regular expression, while spot-chain calls a short perl function (strip_label) to manipulate a string.
(defrule decide-to-fetch-page "If we know about a page that is relevant to our target, note that we should fetch it" (target ?query ?regexp) (object (is-a page) (name ?doc) (type html|unknown) (did 0) (url ?url&:(p-match-p ?url ?regexp))) => (assert (to-fetch ?query 1 ?url ?doc))) |
(defrule spot-chain "Find a chain of 'identical' links" (link ?did ?source ?anchor ?dest $?rest) (link ? ?dest ?anchor ?dest2 $?) (link ? ?dest2 ?anchor ?dest3 $?) (not (link ? ? ?anchor ?source $?)) (not (object (is-a chain-set) (source ?source) (anchor ?anchor))) (object (is-a webitem) (name ?obj) (url =(perl-call-s strip_label ?source))) => (make-instance (gensym "chain-") of chain-set (status possible) (anchor ?anchor) (source-obj ?obj) (source ?source))) |
CAPE is a new tool which brings together two very different programming systems. As such, there are inevitably a number of minor inconsistencies and omissions that reveal the immaturity of the system. Nevertheless, the functionality of the current version of the system has been stable since the spring of 1998, and, as described above, the system itself has been used for the development of substantial research systems since the following summer.
The system was made available by FTP and announced on the Web in February 1999, and has been being downloaded by about one person per day since then.
The core CAPE environment comprises 3000 lines of C code, plus 600 lines of Perl or CLIPS. The distribution currently also includes three demonstration applications (a web-server that can be configured to add links to pages, a demonstration of bi-directional inter-process communication, and a ``dungeon'' game), and 1500 lines of CAPE code in two standard CAPE ``component applications'', or ``Capplets'' (support for regression testing CAPE applications and a simple Web server), with more being developed.
There is also a thirty-one page manual.
CAPE runs under a number of versions of UNIX, including Solaris and Linux.
The main shortcoming of CAPE is the fact that CLIPS and Perl have different syntaxes, CLIPS being lisp-like and Perl being generally C-like. This unfortunately means that fully exploiting CAPE requires achieving reasonable proficiency in both languages, and in working with more than one language at a time (when programming, it is easy to write fragments of the ``wrong'' language). A single ``unified'' syntax is therefore clearly desirable, although defining it is non-trivial, since Perl syntax is already complex and very ``dense'' (in the sense that a great many characters and character combinations are already assigned meanings).
There are a number of obvious extensions to CAPE, some of which will be added to the system in the near future:
CAPE is a powerful tool for building a new generation of KBSs. It provides powerfule mechanisms to support a number of key activities:
CAPE is a powerful tool for building KBSs that can exploit the opportunities offered by the Internet. Download it now!
CAPE was developed while the author was supported by a NEDO Visiting Researcher fellowship at the Electrotechnical Laboratory (ETL) in Tsukuba, Japan. Matt Hurst helped greatly with bug hunting in the early stages, and Srinivasan (1997) was invaluable.
This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 paper.tex.
The translation was initiated by Robert Inder on 5/7/1999