Citación recomendada: Sebastián Bonilla. The Semantic Web and Metarepresentational agents based on Discourse Markers [en linea]. "Hipertext.net", num. 5, 2007. <http://www.hipertext.net>
1. Semantic Web, Ontologies, Metadata and Intelligent Agents
2. Semantics and Meaning: Some problems with computer implementation
3. Meta-representation and Discourse Markers. A proposal for the Semantic Web
As any Internet user has probably seen in their own daily experience, the current Web is made up of a huge amount of scarcely structured and poorly defined resources. The price we pay for this is the emergence of a disproportionate amount of irrelevant information.
One of the formal infrastructural causes explaining this situation is that the current Web is based on HTML. The growing dissatisfaction with this user programming language, which allows hypertext, image, audio and multimedia encoding, could be summed up in the generalised opinion from specialists that HTML is simply a structural language for surface editing.
For the first time in 1999, Tim Berners-Lee, the creator of the World Wide Web, wrote about the hypothetical requisites that must be fulfilled in a future Semantic Web to facilitate "the introduction of meaning, of intelligence in the Internet" (W3C, 1999). In the recently generated academic discourse about the Internet, this position has been retaken and is starting to create a sort of consensus around the idea that the future of the web is starting to form in research on human-like, artificial forms of qualitative intelligence (Berners-Lee, 2001).
The first key step towards the semantic web we will one day see is based on the generalised use of XML language (Cover, 1998), which is has already been included in web page editing programmes for several years now. This language adds the possibility of including an infrastructure of metadata at a code level to the HTML, providing it with ontological descriptions of the information stored in the resource.
In the context of this new take on a website that semantically encodes the meaning of the information, it would be possible to design search bots, content managers and autonomous agents that "understand" the documents and carry out "intelligent" searches, extraction and information processing of the information relevant to the user. According to Berners-Lee, "We are not talking about a magical artificial intelligence that allows computers to understand the user's words, but just the computer's capacity to resolve well-defined problems through well-defined operations on well-defined data" (W3C, 1999).
The Semantic Web hopes to substantially improve the interaction between computer systems and human beings by providing the former with greater intelligence and autonomy, and the latter with a new conceptual info-universe intellectually-ergonomic satisfying the need to convert information into knowledge (for more on this idea see Codina and Rovira, 2006).
In order to understand the resulting change upon creating the Semantic Web, the following example will demonstrate how to resolve one of the fundamental problems with current Internet search engines. Currently, these engines base their searches exclusively on key words in a page's code. Future semantic searches however, will focus their search on ontological information. An ontology is an explicit formal definition, structured in taxonomies, of a body of knowledge (Gruber, 1993). The hypothesis underlying this new research strategy is that an ontology designed to be automatically processed would allow, in theory, for a computer to simulate information processing in they same way humans understand language. However it is true that a machine can not understand the information processed in any deep sense of understanding. However, Artificial Intelligence could take a step closer to human intelligence if the data processed is stipulated semantically through ontologies. In an ontologically structured Semantic Web, the Internet search engines would stop crawling millions of indiscriminate results, usually irrelevant, and offer something similar to qualitative information that could have been selected by a human specialist in the field.
The OntoTag project follows this line of thinking, (Aguado et alii, 2002). This project is a model of semantic text annotation that (a) follows the EAGLES recommendations (1999), set by the European Union in its "Corpus Encoding Standard," initiative, and (b) applies the Semantic Ontology world's abstract descriptive principles (or ontology) (Niremburg & Raskin, 2001) (to see the current state of this field, see OntoWeb, 2002), and (c) introduces the latest RDF(S)/XML mark-up languages.
OntoTag, just like other similar research projects (Benjamins et alii, 1999), aims to semantically annotate web documents to help computers understand textual information within the project of creating the Semantic Web. OntoTags´s uniqueness is that it applies the Corpus Linguistics methodology, "corpus" as Leech explains (1997), a "set of electronic linguistic material that can be digitally processed for different aims, like linguistic research or language engineering." In the project, both of McEney & Wilson´s (2001) semantic annotation modes are applied: (a) the semantic relationships between textual elements (participating agents, patients and participants), and (b) the semantic characteristics of each one of the words that make up a text.
However, those responsible for the OntoTag model understand that "there is no universal agreement in the semantic environment on which characteristics of words should be annotated" and "we must still exhaustively determine the set of basic semantic-cognitive categories." They also understand as the project's key inconvenience the limitations imposed by the current state of technology. "the process of automatically obtaining compact, legible and verifiable pages is a very difficult delimiting and specification task; moreover, including semantic annotations in a web document also would take an increase in document downloading time (though not too much more).
These inconveniences mentioned by the OntoTag directors themselves could include more critical reflections: The first is the danger of obsolescence that corpus annotation, mark-up and tag technology suffers from the continuous change in standards. Then there is the task of semantically annotating the complete Web, making it an ontologically marked and tagged corpus, which due to its size and flexibility appears to be an unachievable endeavour.
Laying aside the current difficulties posed by the creation of an ontologically enriched Web, the next step in the process of completing a Semantic Web would be to make the design of digital products feasible, like the intelligent agents that manage the human user needs executively, autonomously and adaptively.
The ideal intelligent agent (Hendler, 1999), a true challenge for Computer Science researchers, would first of all be communicative. That is, it would interact fluidly with user's personal objects and preferences. Secondly, it would be executive, capable of making decisions and not only present users with various options to choose from. Thirdly, it would be autonomous, capable of acting without the user having to completely and continuously control it. Finally, it would be adaptive, capable of understanding its own operational experience and a user's idiosyncratic preferences.
In other words, intelligent agents would be in charge of routine work, or at times physically impossible work, which is currently done manually by users that browse and interact with the Web. Hendler (1999) shows the specialists´ optimism on the matter: "Simply stated, if you are not currently using technology based on intelligent agents, don't worry, because you soon will."
The Semantic Web draws up a new ideal(ised) technological environment in which information, ontologically organised, enriched by metadata and run by intelligent, communicative, executive, autonomous and adaptive agents becomes knowledge.
In designing programmes for representing and recuperating information, Web focused computer scientists are squeezing every last drop of current linguistic models based on formal rules in morphology and syntax. The most recent tendency in research, which hopes to give a qualitative boost to the field, works around Semantics.
From the perspective of recent developments in linguistics, the semantic perspective applied in the web's interdisciplinary field is possibly creating some excessively enthusiastic expectations, undoubtedly boosted by the words´-"semantic" and "meaning"-ability to fascinate, along with the use of "Semantic Web" not in its strictly technical sense, but as a privileged metaphor, a terminological discovery as if it was a meme (Dawkins, 1976).
It does not therefore useless to briefly critically review what is Semantics, what makes up meaning, and to what degree can both concepts be applied in the future development of the web?
Like Leech (1974) and Lyons (1977) noted in there classic linguistics manuals, semantic theory covers a wide spectrum of topics related with meaning, interpretation and understanding of language; but if both specialists had to agree upon a minimalist definition, it would be, more or less: "Semantics is the study of multiple relationships of meaning that signs establish with their referents."
In any case, this is the key that possibly dazzled those at the cutting edge of artificial intelligence on the Web: the central idea of semantics (which requires a critical review) is that words encode meaning.
Within the logic in which the Semantic Web project is based, the ontology of a single word can be reconstructed, thus formalising its meaning, and after being introduced through a computation protocol, this meaning could be programmed into the resource's code to then be operative on the Web. From a somewhat optimistic point of view, Artificial Intelligence had found in ontologies a possible effective approach to simulating human understanding of language.
From our point of view, the essential problem of this approach stems from the fact that meaning is not in words, as we argue below, but in the mind of who processes them. One of this article's basic premises is that meaning is not a representational phenomenon (linguistic-grammatical) but that it has a meta-representational nature (cognitive-pragmatic).
For example, dictionaries and encyclopaedias, authentic cultural prose that linguistically synthesise a large part of human knowledge, generate the illusion that meaning is in the words, that the final meaning consists in an structured set of words (ontology) that serve to define other words.
We can see how the implementation of the Semantic Web will create an irresolvable self-referential loop (Hofstadter, 1979): words defined by other words, that are again defined by other words...if the meaning of a word is in other words, then where is the meaning of those "other words," and so on.
The semantic-computational proposal could be countered by a biological counterargument. a cerebral lesion in a certain neuronal region could impede a person from accessing the meaning of language. The words would be physically present in the voice that emits them or in the printed letters in books, but the wounded brain can not understand them on their own. The meaning is not in the word, instead it is generated in the mind with a biological base.
The uniqueness of human cognition is explained through the way the brain works, being capable of processing information through trillions of three-dimensional synapses. From a strictly operative perspective, the most optimistic experts in Artificial Intelligence consider that it is in this enormous capacity for complex processing where we find consciousness, intention and understanding, etc. From this point of view, simply by following Moore's law of progressive multiplication of computational power in synthetic processors, as we approach the human capacity to process, Artificial Intelligence would leap into self-consciousness, intention and understanding: Was this not the leap that humanity took ahead of the animal kingdom? Why could this not happen in the artificial world?
In opposition to these still speculative ideas, the fact that meaning is a mental phenomenon that provides humans a cognitive ability is fairly well established, and that this ability has been genetically obtained by our species through millions of years of evolutionary selection. Therefore, semantics and access to meaning have a biological base tied to the intellectual activity of a living organism.
We can not forget that the Semantic Web project is within the realm of information and computer technology, and can not be otherwise. That is, its performance requires the interaction between minds assisted by networked computers. Therefore, meaning is processed in the mind, not in the network; hence there is no current possibility that the Web will develop an autonomous intelligence based on the management of meaning, independent of its managers and human users.
However, the horizon of future research has a key priority of digitally replicating human understanding of language; but this author is quite pessimistic about the idea that the solutions will only come from semantics (see the meta-representational suggestions presented in the following section).
In current research, the irresolvable problem (computation's access to understanding) posed by the introduction of the hypothetical Semantic Web could be used to confirm the words of Searle (2002): "the term Artificial Intelligence has always been a mistake. It is better to talk about cognitive simulation." Stated in other words, the almost inescapable complexity of Semantic applications in computational disciplines makes the case of Artificial Intelligence clearly a meme: a privileged metaphoric expression that stimulates researcher's imagination, but is far from being an empirical reality. Is semantics and meaning what Artificial Intelligence truly needs to make that qualitative leap in approaching human intelligence?
Let us think of the computational problem involved in implementing meaning on the Web from a speculative approach rather than a semantic one.
The cultural anthropologist and pragmatist Sperber (2000) proposed the hypothesis that human beings have a cognitive mechanism, a result of our biological evolution, allowing us to mentally metarepresent linguistic representations. Metarepresentation is a high level intellectual representation of the content in a linguistic representation. With a crucial evolutionary advantage, our species has the ability for metarepresentation allowing for us to live and interact in a world beyond the merely physical and into the mental.
A previous position on this theory by Wilson and Sperber (1990) suggests the coexistence of two types of complementary information in human language; on the one hand discourse has conceptual information (of a semantic nature), and on the other hand it transports computational information (of a pragmatic and metadiscursive nature). That is, metarepresentational information on how to process semantic-conceptual discursive information. In other words, computational instructions of a linguistic interpretation (for more see Wilson, 200 and No, 2000; applied to Spanish see Escandell, 1989 and Portolés, 2005).
More specifically, Wilson and Sperber analysed the function of specific linguistic elements with computational characteristics that directly influence the way in which we process and interpret discursive information, paying special attention to discourse markers since they, first, work as an ostensive stimulus (as a focusing element used purposefully by the speaker to control the receiver's attention). Secondly, they guide the process of discursive interpretation, positively influencing a decrease in cognitive effort that must be invested by the receiver in understanding meaning.
The interest in the function for these types of specific elements, and pragmatic computational information in general discourse, goes beyond the linguistic specialisation and into other areas of Artificial Intelligence. The following is a good example.
There is a growing demand for computer programmes that simulate human intelligence by creating summaries and that can generate hierarchical conceptual maps of the relevant information contained in a body of knowledge.
The automatic textual summarising programmes available on the market often function by applying word frequency statistics indexes (which is nowadays what computers do fastest and best). For computers to produce meaningful data at this level of operation, programmes exclude the count of "empty words," invariables and words without purposeful lexical-semantic content (articles, pronouns, conjunctions, prepositions, markers, connectors, etc.) since they suppose (we think mistakenly) that they do not provide interesting enough information to the text. When run, the programme identifies the chain of characters that are most often repeated as the key words, and the text's key ideas.
Even though it appears as a sensible and common sensical idea, this design forces the programme to operate in some extravagant situations: as we know, well-written texts use many stylistic variations, synonyms and elisions, and not as many repetitions. These linguistic characteristics escape statistical processing and provoke the programme to make mistakes in its summaries.
Now, if following the metarepresentational train of thought, it would be enough for the automatic summary programme to consider the reformulative or conclusive discourse markers (like "in conclusion" or "therefore," etc.), written into the linguistic surface by the author in order to facilitate and control the receiver's interpretative trajectory, thus the level of efficacy is fully transferred past the threshold of acceptance.
The idea that human language is not only conceptual information, but also computational and metarepresentational, may be very useful in the creation of programmes that simulate artificial intelligence.
The mind processes linguistic information in real time, and so it can not stop at all of the detailed processes on all levels (phonetic, phonological, morphological, syntactic, lexical and semantic) proposed by classic Computational Linguistics in its thorough analysis. The latest theories on human intelligence (Hawkins & Blakeslee, 2004) suggest that the mind does not function through a complex system of formal rules, but through patterns of memory and effective processing strategies. Guided by its innate metarepresentative capacity, the mind takes the pragmatic-computational shortcut towards understanding (Bonilla, forthcoming b).
The moment has arrived for us to outline our metarepresentational proposal (cf. Bonilla, 2007). Imagine an application in the future Web. A user launches a research question into the Web, such as "The current state of research in Superstring Theory." What a demanding user would expect from their query, more than just an endless list of web pages whose only common nexus is the word "superstring," is a text with a clear informational structure responding to the question asked. That is, what you would expect as a response if asking a human specialist in the field.
A metarepresentational agent, not based on ontological semantic information, searching for information relative to the question asked, would reject all information resources published more than five years ago. However, it would explore the most often cited works and authors in the reference area, in order to establish what classic works the material is based on, along with the most often mentioned literal citations in the most recent bibliography. This way, a synthesised state of the matter would be obtained, which could lead to the answer to the initial research question.
At the same time, the metarepresentational agent would locate the reformulative discourse markers in order to document which are the fundamental concepts receiving special attention in the selected texts. This way, it would find the areas with the maximum relevancy in which the text speaks about itself, making it explicit, and metarepresenting itself. The advantages of this procedure in comparison to key word searches in HTML or metadata in XML found in the resources code is that the concepts appear in a discursive context, which greatly enriches its contribution to solving the problem. In this sense, the key fundamental concepts would be identified and contextualised into a thematic response to the initial research question.
Also at the same time, the metarepresentational agent would locate the conclusive markers in the selected texts that contain the synthesis of the most relevant information, discursive areas, and of course, tags by the text's author. The advantage this procedure has over the summaries written in the resource code (prototypes of XML), is again its discursive contextualisation, which without a doubt would increase the sensation that we are operating in an intelligent linguistically formulated discourse.
With all this material, the metarepresentational agent would construct and send coherent text to the user's screen. This document would have the following structure:
Title: The initial query.
Summary of the state of the topic: What discourse and who has stated it, setting the material's bases. Operative procedures: Search of the citations in the areas of discursive relevance tagged as "bibliographic references" and its stylistic variants.
Thematic proposal: Identification of the key concepts discursively contextualised. Operative procedure: Search the reformulative markers with most relevancy in the metadiscursive areas in which the discourse explains itself.
Conclusions: Locate the discursive areas which establish the research conclusions. Operative procedure: Search for the conclusive markers that mark out the synthesis areas in which the author has metarepresented his/her own text.
The project of introducing intelligence to the future Semantic Web draws up a new ideal(ised) technological environment in which information, ontologically organised, enriched by metadata and run by intelligent, communicative, executive, autonomous and adaptive agents becomes knowledge.
From our point of view, the essential problem in this position comes from the fact that meaning is not a representational phenomenon (linguistic-grammatical), but a metarepresentational one (cognitive-pragmatic): meaning is a mental phenomenon that gives way to a human cognitive ability, genetically passed on by our species throughout millions of years of Darwinian natural selection. Therefore, semantics and access to meaning have a biological base tied to the intellectual activity of a living organism.
If one thinks about the resolution of the computational problem of introducing meaning (from intelligence) in the web other than from a semantic perspective, the idea that human language possesses metarepresentational information may be useful when creating programmes that simulate intelligence.
Therefore, one could design a metarepresentational agent that strategically locates the relevant information guided by discourse markers that mark out reformulative and conclusion metadiscursive areas. The advantage that this line of research provides is that it permits us to operate with contextualised information, hence enriched information, contained in the discourse itself, rather than the decontextualised information found in the resource's code.
This project is part of the HUM2004-03162/FILO project, " Web Semántica y Sistemas de Información Documental " directed by Lluís Codina, and financed by the Ministerio de Ciencia y Tecnología (Programa nacional de Tecnologías de la Información y de las Comunicaciones).
Aguado, G. et alii (2002): "A Semantic Web Page Linguistic Annotation Model. Semantic Web Meets Language Resources," Technical Report WS-02-16, American Association for Artificial Intelligence, California, AAAI Press.
Benjamins, V.R. et alii (1999): "Building Ontologies for the Internet: A Mid Term Report," International Journal of Human Computer Studies 51, 687-712.
Berners-Lee, T. (2001): "The Semantic Web," Scientific American.
Bonilla, S. (2007): "Web Semántica, Marcadores Discursivos y Metarrepresentación," Revista Electrónica de Lingüística Aplicada. http://dialnet.unirioja.es/servlet/extrev?codigo=6978
Bonilla, S. (forthcoming a): "Marcaje de corpus y Metarrepresentación."
Bonilla, S. (forthcoming b): "Metarrepresentación".
Codina, Ll. and C. Rovira (2006): "La Web Semántica," in Tramullas, J. (coord.): Tendencias en documentación digital, Gijón, Trea, 9-54.
Cover, R. (1998): "XML and Semantic Transparency." http://xml.coverpages.org/xmlAndSemantics.html
Dawkins, R. (1976): El gen egoísta, Barcelona, Salvat.
EAGLES (1999): EAGLES LE3-4244: Preliminary Recommendations on Semantic Encoding, Final Report. http://www.ilc.pi.cnr.it/EAGLES/EAGLESLE.PDF
Escandell, V. (1998): "Metapropositions as metarepresentations," Paper delivered to the Relevance Theory Workshop, Luton.
Gruber, T.R. (1993): "A translation Approach to Portable Ontology Specifications." Knowledge Acquisition 5.2, 199-220.
Hawkins, J. & S. Blakeslee (2004): On Intelligence, New York, Times Book.
Hendler, J. (1999): "Is there an Intelligent Agent in Your Future?" http://www.nature.com/webmatters/agents/agents.html ].
Hofstadter, D. (1979): Gödel, Escher, bach. Un Eterno y Grácil Bucle, Barcelona, Tusquets.
Leech, G. (1974): Semantics, London, Penguin.
Leech, G. (1997): "Introducing corpus annotation," in Garside R., Leech, G. y McEnery, A. M. (eds.): Corpus Annotation: Linguistic Information from Computer Text Corpora, London, Longman.
Lyons, J. (1977): Semantics, Cambridge, Cambrigde University Press.
McEnery, A. M. & Wilson, A. (2001): Corpus Linguistics: An Introduction, Edinburgh, Edinburgh University Press.
Nirenburg, S. & V. Raskin (2001): Ontological Semantics. http://crl.nmsu.edu/Staff.pages/Technical/sergei/book/index-book.html
No, E.J. (2000): Metarepresentation. A Relevance-Theory Approach, Amsterdam, John Benjamins.
Portolés, J. (2005): "Marcadores del discurso y metarrepresentación," in Casado, M. et alii. (eds.): Estudios sobre lo metalingüístico (in Spanish), Berlin, Peter Lang,25-46.
Searle, J. (2001): Rationality in Action, Massachusetts, MIT Press.
Sperber, D. (2000): "Metarepresentations in an evolutionary perspective", in D. Sperber (ed.): Metarepresentations: A Multidisciplinary Perspective, Oxford, Oxford University Press, 117-137.
W3C (1999): "Semantic Web". http://www.w3c.org/2001/sw
Wilson, D. (2000): "Metarepresentation in linguistic communication," in D. Sperber (ed.): Metarepresentations, Oxford, Oxford University Press. http://www.phon.ucl.ac.uk/home/robyn/workshop/papers/wilson.htm
Wilson, D. & D. Sperber (1990): "Linguistic Form and Relevance," UCL Working Papers in Linguistics 2, 95-112.