Infant Learning Grasping and Affordances (ILGA)

Preliminary Draft-please send comments to James Bonaiuto


Reaching and Grasping

Infant Development

Newborn infants aim their arm movements toward fixated objects (von Hoften, 1982). These early arm movements have been related to the development of object-directed reaching (Bhat et al., 2005), leading to grasping (Bhat & Galloway, 2006), the development of which continues throughout childhood (Kuhtz-Buschbeck et al., 1998). Previous related models of infant motor development include Berthier’s (1996) model of learning to reach and the Infant Learning to Grasp Model (ILGM, Oztop, Bradley, & Arbib, 2004). The common thread to both of these models is reinforcement-based learning of intrinsically motivated goal-directed actions based on exploratory movements, or motor babbling.

The Infant Learning Grasping and Affordances (ILGA) model simulates the acquisition of object affordances and reach and grasp parameters based on the success of the final state of grasp actions. It builds upon the ILGM model presented by Oztop, Bradley, & Arbib (2004) by modeling a slightly later stage of infant development. Whereas in ILGM, grasp execution was open-loop and visual input was minimal, in ILGA the visual input is provided as a surface representation of the target object, and visual feedback is used to bring the fingers into contact with their selected target patches on the object surface. This model utilizes the idea of virtual fingers as force application units for synergistic control of the fingers during grasping (Arbib et al., 1985). During grasp planning, targets for 2 virtual fingers are selected based on the visual representation of the object. These targets form an object affordance opposition axis, with a given orientation and size. These value are used to assign actual fingers to each virtual finger and to select values for the wrist rotation and via point. These values are used in the grasp execution phase to preshape the hand, bring the wrist to the selected offset position and rotation, and to guide the virtual fingers to their selected targets. After the grasp has completed, its stability is computed and this is used as a reward signal to adjust the weights of the connections between the layers.

Grasping in development seems to increasingly involve visual information in preprogramming the grasp (Ashmead, et al., 1993; Lasky, 1977; Lockman et al., 1984; Newell et al., 1993; von Hofsten & Ronnqvist, 1988; Witherington, 2005). Lasky (1977) found that reach and retrieval performance was degraded in infants of 5.5 months and older without visual feedback of the hand. However, note that internal models of reaching are presumably learned so that at a certain point, visual guidance of the hand is no longer needed (Bushnell, 1985). Also note the distinction between information for visually triggered grasping and visually guided grasping (Clifton et al., 1993; McCarty et al., 2001). ILGA involves the learning of affordances for visually triggered actions. Learning internal models is beyond the scope of the current model, so the control of the actual reach and grasp is visually guided. However, the affordance information for visually triggered grasping could also be used by an internal model to generate feedforward grasp commands.

In ILGA, reinforcement learning shapes probability distributions over motor parameter values given the output of the affordance extraction module. ILGM modeled early grasp learning, with limited visual processing, thus the affordance extraction module only outputted the presence, position, or orientation of an object in different simulation experiments. In this model, only motor parameters specifying the approach to the object were learned. The only parameter related to the hand posture was the enclosure speed, which specified the speed at which all fingers were flexed during the reach. The enclosure was also triggered by the palmar reflex upon object contact (Twitchell, 1970, Streri, 1993). The final grip configuration was entirely determined by the enclosure and was not prespecified in any way. ILGA models the developmental transition to hand preshape based on visual information (Newell,1989; Schettino et al., 2003; van Hofsten & Ronnqvist, 1988; Witherington, 2005) and expands the Virtual Finger layer of ILGM to encode the assignment of virtual fingers to targets and the mapping from virtual fingers to real fingers (Lyons, 1985).

All the motor parameters handled by the premotor cortex modules in both ILGM and ILGA are kinematics parameters – dynamics is completely ignored. However, in a study of monkeys grasping objects of various sizes, shapes, and orientations, Mason et al. (2004) found that the hand and reach kinematics were independent of grasp force. This indicates a separation of kinematics and dynamics planning in primate grasping that justifies this aspect of ILGA’s design.

One grasp parameter learned by ILGA is the orientation of the wrist relative to the object. Witherington (2005) found that infants starting at 7 months old begin to pre-orient their hands to match an object’s affordances when reaching for that object. Morrongiello & Rocca (1989) showed that by 9 months old, infants are skilled at hand pre-orientation and adjustment and increase reach and grasp efficiency. Thus ILGA corresponds to infant development from 7 months onward.

The model consists of several interconnected 2-layer networks which encode either perceptual or motor parameters. In each network, the first layer forms a (noisy) population code representing a probability distribution over the parameter, while the second layer forms a population code over a value of the parameter selected according to this distribution. In this way, the model can learn probability distributions over combinations of successful parameter values rather than fixed combinations. The connection weights between populations are shaped using reinforcement learning. The use of population coding in reinforcement learning has the advantage of reducing the space of parameter combinations that must be searched since at each iteration the values close to the selected value of each parameter are reinforced as well.

An interesting parallel between ILGM / ILGA and the Berthier (1996) model of reach learning is the role of random exploration and reinforcement. In both models, movements are randomly generated in response to a target and the mechanisms generating the movements are modified via reinforcement. This process of variability and “opportunistic selection” is thought to introduce new behaviors into the motor repertoire (Bertenthal, 1999; Bertenthal & Clifton, 1998; Siegler, 1994).

Neural Control

Parietal Cortex


The intraparietal sulcus of the macaque opened up (from Geyer et al., 2000).

Many parietal regions have been implicated in various aspects of visuomotor transformations, specifically those in reaching and grasping by way of projections to premotor cortex. Using retrograde tracing, Tanne-Gariepy et al. (2002) showed that these projections are segregated into two parallel pathways with distinct origins and destinations within the parietal and premotor cortices. The largest projections to the dorsal premotor cortex (PMd) arose from superior parietal regions such as the medial intraparietal area (MIP), PEc and PGm (Panda & Seltzer, 1982), and the parieto-occipital area. The primary inputs into the ventral premotor cortex (PMv) were the anterior intraparietal sulcus (AIP), PEip (intraparietal part of PE, Matelli et al., 1998), anterior inferior parietal gyrus (area 7b), and somatosensory areas.

The MIP, lateral intraparietal area (LIP), and ventral intraparietal area (VIP) all encode the space around the animal in various coordinate frames (Colby & Goldberg, 1999). The LIP area has been shown to be responsive to the visual presentation of three dimensional objects and anatomically links the visual cortex with the AIP (Nakamura, et al., 2001). While neurons in the LIP can be selective for shape, the region is not as selective as regions in the ventral stream such as the anterior inferotemporal cortex (AIT, Lehky & Sereno 2007). It has been implicated in visual target memory (Gnadt & Andersen, 1988) and motor intentions (Synder et al., 2000; Snyder et al., 1997) and it has been suggested that it represents salient spatial locations (Colby & Goldberg, 1999). Nakamura et al. (2001) found that LIP connects the visual cortex to area AIP. We suggest that LIP selects target locations on the surface of the object for virtual fingers and relays this information to AIP for representation of the chosen affordance.


Comparison of monkey (a) and human (b) parietal lobes with putatitive homologies (from Culham & Kanwisher, 2001).

The caudal part of the intraparietal sulcus (cIPS) is thought to code object features such as shape and orientation (Shikata et al., 2003). Subsets of these neurons have been described as axis-orientation-selective and surface-orientation-selective (Sakata et al., 1998; Taira et al., 2000), meaning that they are tuned to the longitudinal axis of elongated stimuli, or that they are tuned to the orientation of flat stimuli. In an event-related fMRI study, Shikata et al. (2001) found that both cIPS and AIP were activated during a surface orientation discrimination task, but that cIPS was activated to a larger degree. In a follow-up study, Shikata et al. (2003) found that the cIPS was active during surface orientation discrimination, but not during the adjustment of finger positions to match orientation. Sakata et al. (1997) suggest that area cIPS extracts 3D object shape features and projects to AIP. In ILGA, the cIPS encodes the selected object surface opposition axis orientation.

Affordances are defined by Gibson (1966) as opportunities for action that are directly perceivable without recourse to higher-level cognitive functions. It has been widely suggested that area AIP is involved in the extraction of affordances for grasping (Sakata et al., 1998; Fagg & Arbib, 1998; Nakamura et al., 2001). In the Shikata et al. (2003) experiment, AIP was active during both orientation discrimination and finger orientation adjustment. Rice, Tunik, & Grafton (2006) showed that reversible inactivation of AIP in humans using transcranial magnetic stimulation (TMS) disrupts grasping only if applied during execution of the reach, and disrupts reaching to a spatially perturbed object. They suggest that the area dynamically computes a difference vector for use in online grasp feedback control. In monkeys it has been found that neurons in AIP are responsive to 3D features of objects relevant for manipulation (Murata et al., 2000) such as shape, size, and orientation. More recently, Gardner et al. (2007) showed that these neurons are maximally active during the hand preshape during object approach, their activity peaks upon object contact, and thereafter declines. In this model, AIP represents the selected object surface opposition axis for grasping using the target location information from LIP and orientation information from the cIPS. Note that a change in the encoded value in this representation would modify the calculated difference vector in efferent motor structures, and so would yield predictions consistent with the results of Rice, Tunik & Grafton (2006).

Premotor Cortex

Multiple studies point to a role for the PMv in visuomotor transformations required for grasping, while it has been suggested that the PMd provides the same function for reaching (Rizzolatti, et al., 1998). Using reversible inactivation of canonical neurons in F5 of PMv with muscimol, Fogassi et al. (2001) have shown that this disrupts the preshaping portion of a grasp without affecting the reach. In humans, Davare et al. (2006) used transcranial magnetic stimulation (TMS) to induce a virtual lesion in premotor cortex. Impairment of either left or right ventral premotor cortex impaired the correct positioning of the fingers on the object, while only impairment of left (contralateral to the movement) PMv disturbed muscle recruitment patterns. Lesions to left PMd perturbed the coupling between the grasp and the subsequent lift movement. In another muscimol reversible inactivation study, Schieber (2000) showed that PMv can bias the laterality of motor choices (with respect to both the target and the effector). Note at this does not mean that the selection process is taking place in PMv; if the neural populations of PMv represent the output of a selection mechanism, then inactivation of the portions representing target locations and effectors in one hemisphere would bias the output of the entire system. Kurata & Hoshi (1999, 2002) performed a series of experiments with prism adaptation that implicate PMv in the transformation from visual to motor space. Together, these experiments suggest that, to a first approximation, the ventral premotor cortex represents which body part will be used as the effector and the movement required to bring it to the target.

Several recent neurophysiology experiments have illuminated the types of encoding that the ventral premotor cortex might possibly use. Ochiai et al. (2005) found that when the visual perception of the hand is dissociated from its actual location, PMv neurons are selective for the motion of the hand in visual space (rather than motor space) and the part of the hand brought into contact with the object. Raos et al. (2006) found that cells in PMv were responsive for the hand posture used to grasp object and found cells that coded combinations of grip and wrist orientations. It thus seems that PMv encodes the body part to bring in contact with the object, as well as the direction in which to approach it. We suggest that the grip configuration is encoded in terms of virtual fingers, and that separate populations encode the direction of the reach and the grip / wrist orientation.

The premotor cortex could influence motor activity through projections to the primary motor cortex (M1) or by directly projecting to the spinal cord. In support of the first scheme, Shimazu et al. (2004) found no corticospinal output given stimulation of F5 alone. However, they find that when F5 was stimulated before stimulating M1, the corticospinal output normally seen from stimulating M1 was greatly enhanced. Cattaneo et al. (2005) used TMS to show the increase in the excitability in cortical inputs to the motor cortex before grasping movements is specific to those motor cortex neurons that innervate muscles involved in grasping and is object-specific. F5 thus modulates grasp-related outputs from M1 rather than triggering activity in M1. This underscores the role of ILGA in the parameterization of an action, rather than in action selection.

Virtual Finger (VF) Hypothesis

The virtual finger hypothesis states that grasping involves the assignment of real fingers to so-called, virtual fingers or force applicators (Arbib et al., 1985). For example in a power grasp, one virtual finger might be the thumb and the other might be the palm. Specification of a virtual finger includes the real finger(s) assigned to it, as well as the portion of the finger to be brought into contact with the object (i.e. precision pinch uses finger pad, side grasp uses the side of the finger). The task of grasping is then to preshape the hand according to the selected virtual fingers and the size of the object (Lyons, 1985) and bring the opposition axis of the virtual fingers into alignment with the selected object surface opposition axis, enclosing the “virtual fingertips” into contact with their selected targets on the object surface. In this framework, the object surface opposition axis is the grasp affordance. Experimental evidence consistent with this hypothesis, also known as hierarchical control of prehension synergies, has been found (see Zatsiorsky & Latash, 2004; and Winges & Santello, 2005 for reviews, but see Smeets & Brenner, 2001 for possible contrary evidence).


System Overview


A schematic of the parietal and premotor regions involved in reaching and grasping.

ILGA implements grasp and affordance learning in the framework of virtual fingers and opposition space in a manner consistent with the available experimental data on the parietal and premotor cortices. The model consists of an affordance extraction module that receives visual input concerning the target object and projects to the motor planning module. The output of the motor planning module is used by the motor execution module to parametrize the reach and grasp action. These three modules map roughly onto the parietal, premotor, and primary motor cortices, respectively.

The visual input to the affordance extraction module is a representation of the target object surfaces in terms of a depth-coded, two-and-a-half dimensional image matrix, as well as the three-dimensional center of the object. The object center input is encoded by MIP/VIP and projects directly to the F4 population, while the surface representation projects via modifiable connection weights to two populations of neurons. The LIP population contains neurons that are selective for patches on the surface of the object. The activity of these neurons forms a population code over target locations on the object’s surface for the virtual finger that will serve as the end-effector. The cIPS population contains neurons selective for possible object surface opposition axis orientations. The combination of the selected surface opposition axis and the target location on the object surface for the end-effector virtual finger will determine the target location for the second virtual finger (the second intersection of the object surface with the selected axis).

The LIP population projects via connections with modifiable weights to the F4 population. This population encodes the direction of the reach in terms of a wrist via point. Each neuron in this population is selective for a position in a spherical coordinate frame centered on the object center.

Both the LIP and cIPS populations project via fixed connections to the AIP population, whose neurons are selective for particular surface opposition axes. Each surface opposition axis is thus defined by an orientation in 3D space (provided by cIPS) and a surface intercept (provided by LIP).

The AIP and F4 populations both project via modifiable weights to the F5 population. This population’s neurons are selective for combinations of virtual finger configurations and wrist orientations. Virtual finger 1 can be the index finger, the index and middle fingers, or all four fingers. The virtual fingertip can be the pad of the finger, the palm side of any knuckle, or the palm. Virtual finger 2 is always the thumb and the virtual fingertip can be the palm side of any knuckle.

The outputs of the F4 and F5 populations are used by the motor execution module to execute the reach and grasp movement and to bring the virtual fingers to their selected targets on the object’s surface. Once the movement has completed, the stability of the grasp is estimated and used to generate a reward signal to adapt the modifiable inter-population connection weights. The reward signal is generated such that combinations of affordances and motor parameters that led to stable grasps are positively reinforced, while those that led to unstable grasps (defined by the object having a net torque not equal to zero) are negatively reinforced. The model is thus intended to learn to extract an surface opposition axis and parameterize a stable grasp on it given a representation of the object’s visible surface.

Stochastic Population Codes


The weight matrix Wcc

Each population in the parietal and premotor cortices is implemented as a stochastic population code. Each population consists of 2 layers of leaky integrator neurons: one that encodes a (noisy) probability distribution over the values of the represented parameter, and the second that encodes a value selected according to this distribution in a population code. The first layer, P, receives input X and has membrane potential \frac{du^{P}(t)}{dt}\tau_{P}=-u^{P}(t)+X+randn(\mu,\sigma^2)at time t, where randn(\mu,\sigma^2) returns a normally distributed value with mean \mu and variance \sigma^2. The firing rate of neuron i in this layer is given by P(t)=\Theta(u^{P}(t)), where \Theta(x) is a saturation function that bounds x by 0.0 and 1.0. Each neuron P_{i} projects to a corresponding neuron in the second layer, C_{i}, which has a membrane potential given by \frac{du^{C}(t)}{dt}\tau_{C}=-u^{C}(t)+P(t)+(W^{CC}C(t-dt)^T) and firing rate C(t)=\Theta(u^{C}(t)). The weights, W^{CC} are set to W_{ij}^{CC}=e^{-\frac{(i-j)^2}{2\sigma^2}}-.5. They thus end up with an on-center, off-surround connection profile.

Given a random input X composed of a sum of randomly scaled normal distributions with random means and \sigma=2, the network was allowed to run until t=100 with dt=0.01. The network yielded two interesting properties:

  • The noisy probability distribution in layer P yielded a smooth population code in layer C with a small variance.
  • The mean of the population code in layer C is centered on neuron i at time t with approximately the probability given by the Boltzmann equation p(\mu(t)=i)=\frac{1}{1+e^{-\frac{X_{i}(t)}{T}}}, where T is the temperature of the system and is a linear function of the standard deviation of the noise in the layer P.



Three trials of simulating a prototype network with a random input distribution. For each trial, the top row shows the input, the middle row shows P at the end of the simulation, and the bottom row shows C at the end of the simulation.

Joy of Grasping

Grasps are generated during the training phase by randomly calculating random activity in the affordance extraction module and random motor paramter values in the motor planning module by setting the firing rates in the C layer of each population. After generating the grasp, its stability is estimated, the reinforcement signal is calculated, and the modifiable weights in the model are adapted. ILGM and ILGA assume that for infants, successful grasping is inherently rewarding, a phenomenon referred to by Oztop, Bradley, & Arbib (2004) as the “Joy of Grasping”. More generally, it may be that the successful completion of any intentional act is rewarding to some degree whether or not an external reward is obtained. Thus, a more general version of the model for arbitrary manual action might use the same success signal that ACQ uses to adapt executability weights. This signal is based on a comparison of the output of the mirror system with an efference copy of the motor output to determine whether or not an attempted action proceeded as planned.

Precision grasp Side grasp Power grasp

Three successful grasps generated during the babbling phase with a prototype system more closely based on ILGM than the one described here.