If you have any, please send comments to Jinyong Lee
Template Construction Grammar (TCG) is a conceptual model of the language system as well as a computational model of speech production, which runs on input from the vision system, provided as SemRep, a type of graph-like representation that bridges between the vision and language system. An early version of the computation model has been recently implemented.
The overall aim of this approach is to provide clues on how linguistic processes relate to mechanisms which integrate the process of perception of visual scenes and production of utterances. As an initial effort to do this, a preliminary sketch of #SemRep (Semantic Representation) and #Template Construction Grammar (TCG) has been proposed (Arbib & Lee, 2007, 2008). A series of #Eye-tracking Experiments have been designed and conducted in a hope to explore possible explanations and solutions to the linking of visual perception and utterance production. #Extension of TCG section describes how TCG can be used for explaining the claim suggested from the experiment results.
SemRep (Semantic Representation)
SemRep (Semantic Representation) is a hierarchical graph-like representation of conceptual schemas, which is posited as a bridge between the language and vision systems. SemRep is generally in registration with a mental image or a scene (as a 2.5-D sketch) and possibly extended to an episodic description which spans a certain time period. SemRep also needs to be compact enough to be easily transformed into the verbal expression. SemRep is basically proposed to be anchored to an image as an abstraction from the schema assemblages generated from that image but with the crucial addition of actions and events extended in time. Due to the abstract nature of such a representation, SemRep may be viewed as an interpretation of the scene, rather than a snapshot, as it is unlikely to capture all the subtle details of objects and events present in the scene. SemRep encodes only cognitively important events or objects that are estimated to be relevant to the current interests or conditions. Even for the same scene, therefore, SemRep may result in a different graph at each moment, by capturing different aspects of the scene according to the given goals and biases (e.g. from the event described in the below figure, one might focus on the woman’s hitting the man whereas the other focuses on her prettiness and the gaudy color of her dress).
In a similar manner, the structure of SemRep does not have to follow the actual changes of an event of interest, but may focus on conceptually significant changes. For example, an event describable by the sentence “Jack kicks a ball into the net” actually covers several time periods: Jack’s foot swings ⇒ Jack’s foot hits a ball ⇒ the ball flies ⇒ the ball gets into the net. Note that Jack’s foot swings and Jack’s foot hits a ball are combined into Jack kicks a ball, and the ball flies is omitted. This taps into a schema network, which can use stored knowledge to unpack items of SemRep when necessary. The same principle is applied to the topology of SemRep entities. The arrangement of conceptual entities and their connections might or might not follow that of the actual images and objects. A description a man without an arm, for example, does not exactly match an actual object setting since it encodes the conceptual entity of an arm which is missing in the actual image. Here one may need to include what is not in the image to block standard inferences where they are inappropriate. This is akin the notion of inheritance in semantic networks.
Template Construction Grammar (TCG)
In TCG, constructions are basically pairings of meaning and form as they are generally defined in other construction grammar approaches. Each construction contains a partial graph of SemRep for the meaning part and a series of phonetic notations for the form part. The form-meaning pair of each construction in TCG is a type of a template, which is being used as the matching constraints during production or comprehension process. In TCG, the application of the form-meaning pair is bi-directional, meaning that the same construction set is used for both production and comprehension. In the production procedure of TCG, the meaning part acts as a template for selecting proper constructions by being superimposed on the SemRep graph that is going to be translated. In this case, the meaning part is treated as a template which contains the semantic as well as topological constraints of the graph.
Since the Construction Grammar formalism, which encapsulates form and meaning, lays a perfect ground for application of schema theory, constructions are treated as schemas in TCG. Construction schemas are instantiated upon the input to the system, SemRep graphs for the production procedure and verbal expressions for the comprehension procedure. The results, verbal expressions for the production procedure and SemRep graphs for the comprehension procedure, are generated through interactions among construction instances based on the competition and cooperation paradigm.
The production procedure in TCG consists of three main phases: the update, cooperation, competition phase. Note that the main idea of the TCG process is at the competition and cooperation among construction schema instances, which are performed through the three main phases. During the update phase, the internal state of the system is updated, such as the SemRep graph in the working memory, the verbal expression in the phonological buffer, or the currently activated construction instances. Then the system invokes new construction instances based upon the input given to the system; the meaning part of a construction in the repertoire is matched with the currently given SemRep for the production process while the verbal expression, in the form of phonetic notations, is what is going to be matched with the form part of a construction in the repertoire for the comprehension process. Newly invoked construction instances can be connected to other construction instances that are previously invoked, forming a group of instances (forming a hypothesis) if it is possible – grammatically for the production procedure, semantically for the comprehension procedure. Construction instances in the same group strengthen each other’s appropriateness (cooperation among schemas), eventually allowing the larger groups to have more chances to be chosen for the solution. Construction instances, or groups of instances, also compete with each other if they are in conflict (competition among schemas). During the invocation phase, each construction is attached on a certain area on the input, e.g. a subgraph of SemRep for the production process, and the construction is said to cover this area. The computational goal of the system is to cover the whole area of the input with all possible combinations of construction instances. A conflict happens when a construction tries to cover the region which is already covered by other constructions. When there is a conflict, the system tries to assess the appropriateness for the conflicting instances in terms of the match score for each construction instance, which is calculated based on the factors including the semantic closeness of the instance, task requirements, or semantic/syntactic constraints. As noted above, the appropriateness of the instance that is connected to other instances is assessed as the whole group that it belongs to. But there is degradation through the connection hierarchy in the group in order to bias the simpler group to be scored higher. Construction instances, or the whole group, that lose in the competition are eliminated from the system. The procedure cycle iterates until an ‘equilibrium state’ has reached, where no more meaningful change is detected on the construction instances and the internal SemRep.
Two experiments were designed and conducted in order to test hypotheses on how SemRep is built from the perceived visual information and how it influences the choice of constructions for the produced utterances. Complex and natural scenes were selected as visual stimuli to explore various explanations and set up hypotheses that are not easily earned discerned from the controlled visual stimuli that are typical in this type of experiments.
Based on the results, there are four observations suggested.
- Observation 1: One obvious gaze pattern found was: multiple fixations on many items briefly ⇒ revisit each item at utterance. It is named as the macro-to-micro pattern.
- Observation 2: There was another gaze pattern which is quite opposite to the one above: fixate an item ⇒ generate utterance while moving on to next item. It is named as the micro-shift pattern.
- Observation 3: A threshold value, which decides the well-formedness of sentences for utterance, was observed: low threshold for sentence fragments and high threshold for more complete sentences.
- Observation 4: For similar semantics, verbal structure can vary by the organization of ‘discourse units’, which roughly correspond to clauses or complex phrases of utterance. For relatively similar semantics, the verbal description by different subjects can be significantly different, depending on the organization of discourse units.
Based on the implication of the above observations, they leads to the hypothesis that the macro-to-micro pattern and the micro-shift pattern is not the manifestation of two distinct mechanisms, rather two extreme cases of a single mechanism with a change of policy. A speaker can exhibit the macro-to-micro pattern in such cases where threshold is higher (e.g. complex scene) while the micro-shift pattern is manifested when threshold is lower (e.g. straightforward scene), and so on. Simply put, by changing the parameters involved with policy selection, the produced gaze and utterance pattern can be controlled.
This led to designing and conducting the second experiment. In that experiment, natural scenes with some modifications, were carefully manipulated in a hope to elicit different patterns of eye movements and utterance in accordance with varying configurations of the scenes.
In the experiment, there are two types of scenes used. The first type (Type 1) is the scenes were natural scenes with more than one event happening (the main event at the center and other sub-events in the background) are selected. Each scene was manipulated in its background (BG) so that the main event was emphasized while the background was more or less blurred. The scene were paired so that either of the normal-BG ones or the blurred-BG ones was presented to each subject. The second type of scenes (Type 2) was also natural scenes of a single action with different perspectives: direct perspective and side perspective. The actions were chosen in such a way that the action can be described by either sentences with Conjoined NP clauses (e.g. man and woman are shaking hands) or SVO symmetrical sentences (e.g. man (or woman) is shaking hands with woman (or man)). In the direct perspective scenes, both of the agents were clearly shown with relatively the same importance whereas in the side perspective scenes, only one of the agents was shown and the other (especially the face) was partially occluded.
From the result, three aspects were examined and analyzed: the overall fixation distribution, speech pattern, and the initial (~2sec.) fixation pattern. The analysis results suggests that the normal-BG/Direct perspective case induces an eye-gaze and utterance pattern that is more similar to the macro-to-micro pattern. The online task seems to influence the eye-gaze and utterance pattern for some case, but it was not significant. The general conclusion is that the seemingly opposing two patterns may be the result of different scene and task settings. This concludes that it is possible to induce various gaze-utterance patterns by manipulating experiment factors, indicating that the correctness of the hypothesis given above.
Extension of TCG
Since scene interpretation happens in a dynamic manner, SemRep is proposed to be the continuously changing medium between the vision and language system, that are running in parallel. This type of dynamic system would result in a variety of utterance styles even for a single scene. Combined with attention deploy mechanism and working memory (WM), the current version of TCG system can be extended to take this aspect into account. There are four crucial properties that are required for the extended version.
- I. SemRep is stored in WM with certain memory capacity which limits the number of SemRep elements to be processed at the same time.
- II. Attention is required not only for updating SemRep (i.e. interpreting the scene), but also for applying constructions.
- III. Threshold enforces the system to produce utterance when reached.
Combination of the parameters of the extended system can result in two extreme cases of uterance stytle.
A: If threshold is high and attention shifts relatively frequently, then only a small amount of semantic information is available so that only abstract constants can be invoked first (macro-to-micro pattern).
B: If threshold is low and attention does not shift frequently, then relatively detailed semantic information is available for invoking lexical constructions early (micro-shift pattern).
This research is supported in part by the National Science Foundation under Grant No. 0924674 (M.A. Arbib, Principal Investigator), and in part by a Research Grant from the Okawa Foundation.