The HMAX Model

Research in the lab is based on a computational model of object recognition in cortex (Riesenhuber & Poggio, Nature Neuroscience, 1999), dubbed HMAX ("Hierarchical Model and X") by Mike Tarr (Nature Neuroscience, 1999) in his News & Views on the paper. Since we didn't think of a better name beforehand, HMAX stuck. Oh well...

The model summarizes the basic facts about the ventral visual stream, a hierarchy of brain areas thought to mediate object recognition in cortex. It was originally developed to account for the experimental data of Logothetis et al. (Cerebral Cortex, 1995) on the invariance properties and shape tuning of neurons in macaque inferotemporal cortex, the highest visual area in the ventral stream. In the meantime, the model has been shown to predict several other experimental results and provide interesting perspectives on still other data and claims. The model is used as a basis in a variety of projects in our and other labs.

The goal is to explain cognitive phenomena in terms of simple and well-understood computational processes in a physiologically plausible model. Thus, the model is a tool to integrate and interpret existing data and to make predictions to guide new experiments. Clearly, the road ahead will require a close interaction between model and experiment. Towards this end, this web site provides background information on HMAX, including the source code, and further references:

Please contact with questions or comments.

Our collaborators in Tommy Poggio's group at MIT have done some really nice work with the model, in particular its application to machine vision problems. Please click here for an overview of their software and papers on the topic (look for "Model of Object Recognition").

The "Standard Model"

Object recognition in cortex is thought to be mediated by the ventral visual pathway (Ungerleider, 1994) running from primary visual cortex, V1, over extrastriate visual areas V2 and V4 to inferotemporal cortex, IT. Based on physiological experiments in monkeys, IT has been postulated to play a central role in object recognition. IT in turn is a major source of input to PFC, "the center of cognitive control" (Miller, 2000) involved in linking perception to memory.

Over the last decades, several physiological studies in non-human primates have established a core of basic facts about cortical mechanisms of recognition that seem to be widely accepted and that confirm and refine older data from neuropsychology. A brief summary of this consensus knowledge begins with the groundbreaking work of Hubel and Wiesel first in the cat (Hubel, 1962, 1965) and then in the macaque (Hubel, 1968). Starting from simple cells in primary visual cortex, V1, with small receptive fields that respond preferably to oriented bars, neurons along the ventral stream (Perrett, 1993; Tanaka, 1996; Logothetis, 1996) show an increase in receptive field size as well as in the complexity of their preferred stimuli (Kobatake, 1994). At the top of the ventral stream, in anterior inferotemporal cortex (AIT), cells are tuned to complex stimuli such as faces (Gross, 1972; Desimone, 1984, 1991; Perrett, 1992). A hallmark of these IT cells is the robustness of their firing to stimulus transformations such as scale and position changes (Tanaka, 1996; Logothetis, 1995, 1996; Perrett, 1993). In addition, as other studies have shown (Perrett, 1993; Booth, 1998; Logothetis, 1995; Hietanen 1992), most neurons show specificity for a certain object view or lighting condition.

A comment about the architecture is important: In its basic, initial operation - akin to "immediate recognition" - the hierarchy is likely to be mainly feedforward (though local feedback loops almost certainly have key roles) (Perrett, 1993). ERP data (Thorpe, 1996) have shown that the process of object recognition appears to take remarkably little time, on the order of the latency of the ventral visual stream (Perrett, 1992), adding to earlier psychophysical studies using a rapid serial visual presentation (RSVP) paradigm (Potter, 1975; Intraub, 1981) that have found that subjects were still able to process images when they were presented as rapidly as 8/s.

In summary, the accumulated evidence points to six mostly accepted properties of the ventral stream architecture:

  • A hierarchical build-up of invariances first to position and scale and then to viewpoint and more complex transformations requiring the interpolation between several different object views
  • in parallel, an increasing size of the receptive fields
  • an increasing complexity of the optimal stimuli for the neurons
  • a basic feedforward processing of information (for "immediate" recognition tasks)
  • plasticity and learning probably at all stages and certainly at the level of IT
  • learning specific to an individual object is not required for scale and position invariance (over a restricted range).

These basic facts lead to a Standard Model, likely to represent the simplest class of models reflecting the known anatomical and biological constraints. It represents in its basic architecture the average belief - often implicit - of many visual physiologists. In this sense it is definitely not "our" model. The broad form of the model is suggested by the basic facts; we have made it quantitative, and thereby predictive (through computer simulations).

Figure 1

Figure 1: schematic of the Standard Model

The model reflects the general organization of visual cortex in a series of layers from V1 to IT to PFC. From the point of view of invariance properties, it consists of a sequence of two main modules based on two key ideas. The first module, shown schematically above, leads to model units showing the same scale and position invariance properties as the view-tuned IT neurons of (Logothetis, 1995), using the same stimuli. This is not an independent prediction since the model parameters were chosen to fit Logothetis' data. It is, however, not obvious that a hierarchical architecture using plausible neural mechanisms could account for the measured invariance and selectivity. Computationally, this is accomplished by a scheme that can be best explained by taking striate complex cells as an example: invariance to changes in the position of an optimal stimulus (within a range) is obtained in the model by means of a maximum operation (max) performed on the simple cell inputs to the complex cells, where the strongest input determines the cell's output. Simple cell afferents to a complex cell are assumed to have the same preferred orientation with their receptive fields located at different positions. Taking the maximum over the simple cell afferent inputs provides position invariance while preserving feature specificity. The key idea is that the step of filtering followed by a max operation is equivalent to a powerful signal processing technique: select the peak of the correlation between the signal and a given matched filter, where the correlation is either over position or scale. The model alternates layers of units combining simple filters into more complex ones - to increase pattern selectivity with layers based on the max operation - to build invariance to position and scale while preserving pattern selectivity.

In the second part of the architecture, shown above, learning from multiple examples, i.e., different view-tuned neurons, leads to view-invariant units as well as to neural circuits performing specific tasks. The key idea here is that interpolation and generalization can be obtained by simple networks, similar to Gaussian Radial Basis Function networks (Poggio, 1990) that learn from a set of examples, that is, input-output pairs. In this case, inputs are views and the outputs are the parameters of interest such as the label of the object or its pose or expression (for a face). The Gaussian Radial Basis Function (GRBF) network has a hidden unit for each example view, broadly tuned to the features of an example image (see also deBeeck (2001)). The weights from the hidden units to the output are learned from the set of examples, that is input-output pairs. In principle two networks sharing the same hidden units but with different weights (from the hidden units to the output unit), could be trained to perform different tasks such as pose estimation or view-invariant recognition. Depending just on the set of training examples, learning networks of this type can learn to categorize across exemplars of a class (Riesenhuber AI Memo, 2000) as well as to identify an object across different illuminations and different viewpoints. The demonstration (Poggio, 1990) that a view-based GRBF model could achieve view-invariant object recognition in fact motivated psychophysical experiments (Buelthoff, 1992; Gauthier, 1997). In turn the psychophysics provided strong support for the view-based hypothesis against alternative theories (for a review see Tarr (1998)) and, together with the model, triggered the physiological work of Logothetis (1995).

Thus, the two key ideas in the model are:

  • the max operation provides invariance at several steps of the hierarchy
  • the RBF-like learning network learns a specific task based on a set of cells tuned to example views.

Inside HMAX

Figure 2

Figure 2: The basic HMAX model consists of a hierarchy of five levels, from the S1 layer with simple-cell like response properties to the VTU level with shape tuning and invariance properties like the view-tuned cells found in monkey inferotemporal cortex (see Logothetis et al., 1995).

For more information, please see the original publications. The basic model is described in the 1999 Nature Neuroscience paper:

Riesenhuber, M. & Poggio, T. (1999). Hierarchical Models of Object Recognition in Cortex. Nature Neuroscience 2: 1019-1025.

More details on how tuning properties, in particular invariance ranges in HMAX depend on pooling parameters, can be found in:

Schneider, R., & Riesenhuber, M. (2004). On the Difficulty of Feature-based Attentional Modulations in Visual Object Recognition: A Modeling Study. CBCL Paper #235/AI Memo #2004‒004, Massachusetts Institute of Technology, Cambridge, MA, February 2004.

S1 Layer

In the HMAX model of object recognition in the ventral visual stream of primates, input images (we used 128 x 128 or 160 x 160 greyscale pixel images) are densely sampled by arrays of two-dimensional Gaussian filters, the so-called S1 units (second derivative of Gaussian, orientations 0°, 45°, 90°, and 135°, sizes from 7 x 7 to 29 x 29 pixels in two-pixel steps) sensitive to bars of different orientations, thus roughly resembling properties of simple cells in striate cortex. At each pixel of the input image, filters of each size and orientation are centered. The filters are sum-normalized to zero and square-normalized to 1, and the result of the convolution of an image patch with a filter is divided by the power (sum of squares) of the image patch. This yields an S1 activity between −1 and 1.

C1 Layer

In the next step, filter bands are defined, i.e., groups of S1 filters of a certain size range (7 x 7 to 9 x 9 pixels; 11 x 11 to 15 x 15 pixels; 17 x 17 to 21 x 21 pixels; and 23 x 23 to 29 x 29 pixels). Within each filter band, a pooling range is defined (variable poolRange) which determines the size of the array of neighboring S1 units of all sizes in that filter band which feed into a C1 unit (roughly corresponding to complex cells of striate cortex). Only S1 filters with the same preferred orientation feed into a given C1 unit to preserve feature specificity. We used pooling range values from 4 for the smallest filters (meaning that 4 x 4 neighboring S1 filters of size 7 x 7 pixels and 4 x 4 filters of size 9x9 pixels feed into a single C1 unit of the smallest filter band) over 6 and 9 for the intermediate filter bands, respectively, to 12 for the largest filter band. The pooling operation that the C1 units use is the "MAX" operation, i.e., a C1 unit's activity is determined by the strongest input it receives. That is, a C1 unit responds best to a bar of the same orientation as the S1 units that feed into it, but already with an amount of spatial and size invariance that corresponds to the spatial and filter size pooling ranges used for a C1 unit in the respective filter band. Additionally, C1 units are invariant to contrast reversal, much as complex cells in striate cortex, by taking the absolute value of their S1 inputs (before performing the MAX operation), modeling input from two sets of simple cell populations with opposite phase. Possible firing rates of a C1 unit thus range from 0 to 1. Furthermore, the receptive fields of the C1 units overlap by a certain amount, given by the value of the parameter c1Overlap. We mostly used a value of 2, meaning that half the S1 units feeding into a C1 unit were also used as input for the adjacent C1 unit in each direction. Higher values of c1Overlap indicate a greater degree of overlap.

S2 Layer

Within each filter band, a square of four adjacent, nonoverlapping C1 units is then grouped to provide input to a S2 unit. There are 256 different types of S2 units in each filter band, corresponding to the 4^4 possible arrangements of four C1 units of each of four types (i.e., preferred bar orientation). The S2 unit response function is a Gaussian with mean 1 (i.e., {1; 1; 1; 1}) and standard deviation 1, i.e., an S2 unit has a maximal firing rate of 1 which is attained if each of its four afferents fires at a rate of 1 as well. S2 units provide the feature dictionary of HMAX, in this case all combinations of 2 x 2 arrangements of "bars" (more precisely, C1 cells) at four possible orientations.

C2 Layer

To finally achieve size invariance over all filter sizes in the four filter bands and position invariance over the whole visual field, the S2 units are again pooled by a MAX operation to yield C2 units, the output units of the HMAX core system, designed to correspond to neurons in extrastriate visual area V4 or posterior IT (PIT). There are 256 C2 units, each of which pools over all S2 units of one type at all positions and scales. Consequently, a C2 unit will fire at the same rate as the most active S2 unit that is selective for the same combination of four bars, but regardless of its scale or position.

VTU Layer

C2 units then again provide input to the viewtuned units (VTUs), named after their property of responding well to a certain two-dimensional view of a three-dimensional object, thereby closely resembling the view-tuned cells found in monkey inferotemporal cortex by Logothetis et al. The C2 to VTU connections are so far the only stage of the HMAX model where learning occurs. A VTU is tuned to a stimulus by selecting the activities of the 256 C2 units in response to that stimulus as the center of a 256-dimensional Gaussian response function, yielding a maximal response of 1 for a VTU in case the C2 activation pattern exactly matches the C2 activation pattern evoked by the training stimulus. To achieve greater robustness in case of cluttered stimulus displays, only those C2 units may be selected as afferents for a VTU that respond most strongly to the training stimulus. An additional parameter specifying response properties of a VTU is its sigma value, or the standard deviation of its Gaussian response function. A smaller sigma value yields more specific tuning since the resultant Gaussian has a narrower half-maximum width.

Source Code

Link coming soon...