Primary Visual Cortex, Neural Coding and Visual Processing

Primary Visual Cortex

Q1: Why is the primary visual cortex called the striate cortex?

Question: 1) Why is the primary visual cortex often called the “striate” cortex?

Answer: It is called the striate cortex because, under the microscope, layer IV contains a visible stripe (the line of Gennari) formed by dense bands of myelinated axons arriving from the lateral geniculate nucleus (LGN). This distinct striation gives V1 its name and marks the major input layer that separates it from other cortical regions.

Q2: Inputs and projections of primary visual cortex

Question: 2) From where does primary visual cortex receive its input, and to where does it project?

Answer: V1 receives its main input from the LGN of the thalamus, which in turn receives signals from retinal ganglion cells (RGCs). The pathway follows:

  • Image → Photoreceptors → RGCs (center–surround receptive fields) → LGN → V1

In V1, information diverges into simple and complex cells for further feature analysis. V1 then projects feedforward outputs to higher visual areas (V2, V3, MT, etc.) for more complex shape, motion, and depth processing, and sends feedback to LGN to modulate attention and gain control.

Q3: Typical simple-cell receptive fields

Question: 3) What do typical simple-cell receptive fields look like?

Answer: Simple cells in V1 have elongated, oriented receptive fields made of distinct ON (excitatory) and OFF (inhibitory) subregions.

  • The ON zones (red/yellow) respond to bright bars or edges of light.
  • The OFF zones (blue) respond when light is removed.

They respond best to bars or edges of a specific orientation and position, forming the building blocks for edge and contour detection. The tuning curve of a simple cell is unimodal—its firing rate peaks at the cell’s preferred orientation.

Q4: Difference between simple and complex cells

Question: 4) What is the difference between a simple and complex cell?

Answer:

PropertySimple CellComplex Cell
Receptive-field structureDistinct ON/OFF subregionsNo clear ON/OFF areas
Position sensitivitySensitive to precise bar locationPosition-invariant (fires anywhere in field if orientation is right)
Response to drifting gratingsSharp, periodic spikes (modulated; F1/F0 > 1)More uniform, summed firing (F1/F0 < 1)
LayerMostly in layer 4 (input)Layers 2, 3, 5, 6 (feedforward + feedback)

Complex cells integrate information from many simple cells with similar orientation preferences, creating position and polarity invariance.

Q5: Why are complex cells higher in the hierarchy?

Question: 5) How to interpret “complex cells are higher in the processing hierarchy than simple cells”?

Answer: Complex cells receive input from multiple simple cells, summing their outputs and losing sensitivity to exact position or contrast polarity. This makes them more abstract and invariant—a higher-level stage of visual representation. The hierarchy continues: simple → complex → composite → shape-tuned cells.

Q6: Olshausen and Field 1996 result

Question: 6) What did Bruno Olshausen and David Field demonstrate in their 1996 Nature paper?

Answer: They trained a computational network on natural images to reconstruct visual input efficiently. When optimized for sparse, energy-efficient coding (minimizing active filters), the model’s receptive fields resembled V1 simple cells—localized, oriented Gabor-like filters.

→ This showed that V1’s receptive-field organization can emerge from the principle of efficient coding of natural scenes.

Q7: Surround suppression differences

Question: 7) How is surround suppression in V1 neurons different from that in retinal ganglion cells?

Answer:

  • Retinal RGCs: Center–surround receptive fields are largely hardwired, with both excitatory and inhibitory surround regions controlling local contrast.
  • V1 neurons: Surround suppression is context-dependent—when a visual grating extends beyond a neuron’s optimal receptive field, activity decreases for same-orientation stimuli. The surround is largely inhibitory and not excited by any stimulus.

This cortical surround integrates more global visual context (orientation, contrast, direction), reflecting higher-order processing.

Q8: Measuring ocular dominance columns

Question: 8) How would you measure ocular dominance columns in a cat or monkey?

Answer: You could use:

  • Autoradiography: Inject a radioactive tracer (e.g., ³H-proline or radiolabeled glucose) into one eye. After patterned visual stimulation, alternating light/dark stripes appear in V1, showing eye-specific input bands.
  • Optical imaging: Apply a voltage-sensitive dye to visualize cortical activity from each eye during patterned stimulation; distinct ocular-dominance stripes appear across layer 4C.

Both reveal the zebra-like ocular dominance pattern where each column prefers input from one eye.

Q9: Concept of a hypercolumn

Question: 9) Describe the concept of the hypercolumn in primary visual cortex.

Answer: A hypercolumn is the complete functional unit of V1 representing one small point in visual space. Within that patch (~1 mm²):

  • All orientation columns (0–180°) are present.
  • Both left- and right-eye ocular-dominance columns are represented.

Thus every location in the visual field has a corresponding hypercolumn that encodes all possible orientations and binocular inputs for that point—maintaining retinotopy, orientation maps, and ocular-dominance structure together.


Neural Coding

Q1: What is a tuning curve?

Question: 1) What is a tuning curve?

Answer: A tuning curve shows how strongly a neuron responds to different values of a specific stimulus property—for example, orientation, spatial frequency, or size. The x-axis represents the stimulus feature (e.g., angle of orientation); the y-axis shows the neuron’s average firing rate (spikes/sec) across many identical trials. The curve is usually bell-shaped (unimodal), meaning the neuron fires most for its preferred feature and less as the feature deviates. It describes what the neuron prefers and how sharply tuned it is to that property.

Q2: Units for neuronal activity

Question: 2) What are the units with which we usually report the measured activity of a neuron?

Answer: Neuronal activity is typically measured in spikes per second (Hz)—the firing rate. This represents how many action potentials the neuron produces during a given time window in response to a stimulus.

Q3: Problems with single-neuron bell-shaped tuning

Question: 3) Let’s assume a neuron has a “bell-shaped” tuning curve. What are the two general problems for a precise encoding of a stimulus feature (e.g., orientation) with this single neuron?

Answer: There are two main problems when relying on a single neuron’s tuning curve to represent a stimulus:

  • Ambiguity (non-unique mapping): The neuron fires at the same rate for two different orientations (one on each side of its preferred value). → You can’t tell whether a certain firing rate corresponds to, for example, 40° or 140°.
  • Noise / Variability: Neurons are noisy and variable—even for the same stimulus, they don’t fire exactly the same number of spikes each time. → Each trial is like one random draw from a Poisson distribution, so a single neuron’s response can’t be perfectly reliable.

Q4: Poisson-distributed neural variability

Question: 4) Neural response variability is Poisson-distributed. What does that mean?

Answer: It means that the number of spikes a neuron produces over repeated identical presentations of the same stimulus is random but follows the Poisson distribution: the mean equals the variance (a key property). Some trials will have more spikes, some fewer, even though the stimulus is identical. This variability is a fundamental feature of cortical neurons.

Q5: Population coding of stimulus features

Question: 5) How can we think of encoding (i.e., representing) a stimulus feature with an entire population of neurons?

Answer: Instead of one neuron representing the stimulus, the brain uses a population code—many neurons with slightly different preferred orientations work together. Each neuron “votes” according to how strongly it fires. The combined pattern of activity across the population represents the stimulus. This allows the brain to estimate the true stimulus value by calculating a population vector—the weighted average (“center of mass”) of all neurons’ preferred orientations based on their firing rates.

Q6: How population codes overcome single-neuron limits

Question: 6) How does a population code overcome the shortcomings of a single-neuron code?

Answer: A population code solves both ambiguity and noise problems:

  • Ambiguity: Each neuron has a slightly different tuning curve, so the pattern of activity across neurons uniquely identifies the stimulus.
  • Noise reduction: Because noise is independent (or partly independent) across neurons, the brain can average over many neurons, which cancels out random variability—producing a more stable and accurate representation.

This is sometimes called a “democratic vote” among neurons.

Q7: What are noise correlations?

Question: 7) What are noise correlations?

Answer: Noise correlations describe how variability in one neuron’s firing relates to another’s:

  • If two neurons’ fluctuations go up and down together, their noise is positively correlated.
  • If one increases while the other decreases, they are negatively (anti-) correlated.
  • If they vary independently, there is no correlation.

Noise correlations are often higher between neurons that are tuned to similar orientations and are physically close together, especially in V1.

Q8: When are noise correlations harmful?

Question: 8) When are noise correlations a bad thing for encoding?

Answer: Noise correlations are bad when they occur in the direction perpendicular to the decision boundary, meaning they cause neurons to make the same mistakes about the stimulus. These positively correlated noises reduce the benefits of averaging, since the errors add up rather than cancel out. In contrast, “good” correlations (anti-correlations) occur parallel to the decision boundary and do not affect downstream decoding—they let the brain filter out noise efficiently.

In short:

  • Bad correlations = shared noise that distorts the population’s representation.
  • Good correlations = shared noise that doesn’t interfere with accurate decoding.

Motion

Q1: Space-time receptive field

Question: 1) What is a space-time receptive field?

Answer: A space-time receptive field describes how a neuron’s firing depends on both spatial position and time—essentially how a visual pattern changes across space and time. In motion processing, it captures how a neuron responds to moving stimuli, not just static patterns. In V1, motion-sensitive neurons have spatiotemporal receptive fields tilted in space–time coordinates, meaning they respond when a stimulus moves in a specific direction and speed. A stationary stimulus produces a vertical stripe in the space-time map, while a moving stimulus forms a diagonal stripe—neurons detect this diagonal “tilt” as motion.

Q2: Evidence for local motion detectors

Question: 2) Name two reasons why we can assume that perception of visual motion is based on local motion detectors/neurons.

Answer:

  • Direction-selective neurons exist in early visual cortex (V1 and MT). These neurons respond only when a stimulus moves in their preferred direction and speed, showing that motion is locally encoded by elementary motion detectors, such as Reichardt detectors.
  • Adaptation aftereffects prove direction tuning. When a motion-sensitive neuron is stimulated repeatedly (e.g., watching a waterfall), it becomes fatigued. The motion aftereffect (stationary objects appearing to move afterward) shows that the brain uses opponent motion detectors, one for each direction—a clear sign of localized motion processing.

Q3: Location of area MT

Question: 3) Where is area MT located in the brain?

Answer: Area MT (middle temporal area), also called V5, lies in the posterior portion of the superior temporal sulcus within the dorsal stream (the “where” or motion pathway) of the primate brain. It is part of the parietal/dorsal visual pathway responsible for motion, depth, and spatial relationships.

Q4: Main inputs to MT

Question: 4) Which neurons are providing the main input to neurons in area MT?

Answer: MT neurons receive input primarily from direction-selective neurons in V1. Many V1 neurons with similar direction preferences converge onto a single MT neuron, allowing MT to integrate motion information over larger regions of space. This pooling gives MT neurons larger receptive fields and greater direction and speed selectivity. Flow of information: V1 → MT → MST (with MST performing even more invariant, complex motion analysis).

Q5: MT tuning characteristics

Question: 5) What are the tuning characteristics of neurons in area MT?

Answer: MT neurons show strong direction and speed selectivity:

  • Direction tuning: MT cells respond maximally to a specific direction of motion, but less strongly to others—the tuning curve for direction is relatively broad around the preferred direction.
  • Speed tuning: Each neuron has a preferred speed at which its response peaks, then decreases at slower or faster speeds.
  • Other properties: receptive fields are larger than those in V1; they are retinotopically organized. Most MT neurons are strongly direction-selective and contribute to perception of coherent motion.

Q6: Evidence MT governs motion perception

Question: 6) Which four measurements/manipulations together provided strong evidence that MT neurons are governing the perception of visual motion?

Answer: Four key types of causal evidence link MT activity to motion perception:

  • Lesion evidence: When MT is temporarily inactivated or damaged, animals show motion perception deficits—they cannot reliably judge motion direction, even though visual acuity and contrast sensitivity remain normal.
  • Neural–behavioral correlation: The psychometric function (behavioral performance) and neurometric function (single-neuron sensitivity) match closely—indicating that MT neurons’ responses predict motion discrimination ability.
  • Microstimulation: Artificially activating a group of MT neurons biases motion perception toward those neurons’ preferred direction, proving those neurons are causally linked to perception.
  • Trial-by-trial correlation (choice probability): When the same motion stimulus is presented repeatedly, small fluctuations in an MT neuron’s firing correlate with variations in the animal’s perceptual choice—linking neural noise to perceptual noise.

Together, these four findings demonstrate that MT activity governs motion perception.

Q7: Random-dot motion stimulus and coherence

Question: 7) Describe a random-dot motion stimulus; what is the stimulus parameter that allows us to change the difficulty with which the direction of motion can be perceived?

Answer: A random-dot motion (RDM) stimulus consists of a field of moving dots, where a certain percentage of dots move coherently in one direction, while the rest move randomly. The key parameter is motion coherence (the percentage of correlated dots). High coherence (e.g., 12.8%) → easy task; strong global motion signal. Low coherence (e.g., 0.8%) → hard task; motion appears noisy and ambiguous. The RDM allows researchers to precisely control task difficulty and test both behavioral and neuronal sensitivity.

Q8: Computing neurometric curves

Question: 8) How can we compute the probability of correctly discriminating between a stimulus moving in the preferred vs. null direction of an MT neuron based on the neuron’s response (i.e., the neurometric curve)?

Answer: Record the neuron’s responses (spike counts) to many trials of preferred and null motion directions. Plot the two response distributions (spike count histograms). The degree of overlap between the distributions determines how distinguishable the two directions are. Compute the area under the ROC curve (receiver operating characteristic)—this gives the neurometric curve, representing the probability that an ideal observer could discriminate between preferred and null motion based solely on that neuron’s activity. A neuron with no overlap (perfect separation) = 100% correct discrimination. More overlap = lower discriminability.

Q9: What is choice probability?

Question: 9) What is choice probability?

Answer: Choice probability quantifies how strongly a neuron’s activity on a single trial predicts the animal’s perceptual decision when the stimulus itself is ambiguous (e.g., low-coherence RDM). If a neuron fires more on trials when the subject reports motion in that neuron’s preferred direction, its choice probability is high (close to 1). A value around 0.5 means no relationship between neural firing and behavior. Thus, choice probability links trial-by-trial neural variability to variability in perceptual choices, supporting the idea that MT neuron activity is directly involved in motion perception.


Depth Perception and Stereo

Q1: Fundamental challenge for depth perception

Question: 1) What is the fundamental challenge for perceiving depth with visual information?

Answer: The main challenge is that the 3-dimensional world projects onto a 2-dimensional retina, so depth information is lost in the image that reaches each eye. The brain must reconstruct the missing third dimension using various cues—differences between the two eyes (binocular disparity), motion, and learned pictorial cues like shading, occlusion, and perspective. Essentially, vision must infer “how far away” something is from flat, two-dimensional input.

Q2: Are two eyes necessary?

Question: 2) Are two eyes necessary for perceiving 3D object information?

Answer: No—depth perception remains possible with one eye (monocular cues), but it is less accurate. When one eye is closed, the brain can still use pictorial cues: shading, height/perspective, occlusion, and perspective (SHOP). However, with two eyes open, the brain can also use binocular disparity—a geometric cue that gives quantitative (metric) depth rather than just ordinal (“which is closer”) information. Thus, two eyes make depth perception more precise and robust.

Q3: What is binocular disparity?

Question: 3) What is binocular disparity?

Answer: Because the two eyes are horizontally separated, each eye sees a slightly different image of the world. This positional difference in where an object falls on the left and right retinas is called binocular disparity (or horizontal disparity). If an object projects to the same retinal position in both eyes, it has zero disparity (it lies on the horopter). If it projects to different positions, it has non-zero disparity, indicating that it is either nearer (positive disparity) or farther (negative disparity) than the fixation point. The brain uses these disparities to reconstruct depth in 3D space.

Q4: Define the horopter

Question: 4) What do we refer to as the horopter?

Answer: The geometric horopter is an imaginary curved surface in space where all points produce zero disparity—they project to corresponding retinal locations relative to the fovea in both eyes. Points on the horopter appear at the same depth as the fixation point. Points in front of it produce positive (near) disparity. Points behind it produce negative (far) disparity. The horopter defines the reference for how the brain measures relative depth.

Q5: Wiring for disparity tuning

Question: 5) How are neurons “wired up” in order to be tuned for binocular disparity?

Answer: In V1, neurons receive input from both eyes after signals from the left and right LGN converge beyond layer 4C (where they are still monocular). Some V1 neurons fire maximally when the images from both eyes fall on corresponding retinal locations (zero disparity). Others are tuned for slightly offset inputs, firing more strongly when one eye’s image is shifted relative to the other—i.e., when an object lies closer or farther than the fixation point. Thus, V1 neurons act as disparity detectors, encoding depth through selective convergence of inputs from the two eyes.

Q6: How stereoscopes and 3D movies work

Question: 6) How does a stereoscope work? How about a modern 3D movie?

Answer: Both rely on presenting slightly different images to each eye to mimic binocular disparity:

  • Stereoscope: Two images taken from slightly different viewpoints are shown simultaneously, one to each eye. Each eye sees only its assigned image, and the brain fuses them into a single 3D percept. → This is a position-based (simultaneous) system.
  • Modern 3D movies:
    • Anaglyph (bi-color) glasses: Each lens filters light (e.g., red vs. green), so each eye receives one of two color-coded images projected with small horizontal offsets.
    • Shutter glasses: The screen rapidly alternates the left and right images while the glasses’ lenses alternately block each eye in sync with the display refresh rate.
    → These are time-based or color-based systems that exploit the brain’s stereo fusion mechanism to simulate depth.

Q7: MT tuned to motion at different depths

Question: 7) Is it surprising that neurons in area MT are tuned to stimulus motion at different depths (relative to fixation)? Discuss this within the overall organization principle of the visual cortex.

Answer: It is not surprising. Area MT sits along the dorsal visual stream (“where” pathway), which integrates information about motion, depth, and spatial relationships. Since MT neurons pool input from disparity-tuned V1 neurons, they naturally combine motion and depth cues, becoming tuned to 3D motion (motion through depth). This fits the hierarchical organization principle: V1 extracts basic features like orientation and disparity; MT integrates those signals over larger receptive fields to represent motion in depth; MST then adds even higher-order invariances like self-motion and optic flow. So MT’s sensitivity to both motion and disparity is a direct consequence of how the visual system is organized.

Q8: Causal evidence MT determines perceived depth

Question: 8) How did researchers demonstrate that neurons in area MT are likely to causally determine perceived depth of (moving) visual stimuli?

Answer: Researchers used the same four causal criteria applied earlier for motion perception to show that MT activity directly determines depth perception:

  • Removal/Inactivation: Temporarily deactivating MT with muscimol (a GABA agonist) caused severe depth-discrimination deficits, while contrast sensitivity remained normal—proving MT’s necessity.
  • Neural–behavioral sensitivity match: The neurometric curve (based on single MT neurons’ responses to binocular correlation) matched the monkey’s psychometric curve (behavioral performance). The ratio of neural to behavioral thresholds was close to 1, meaning MT neurons were as sensitive as the perception itself.
  • Artificial activation (microstimulation): Stimulating MT neurons tuned to a specific disparity biased the animal’s depth judgments toward that preferred depth—producing a direct causal shift in perception.
  • Trial-by-trial correlation (choice probability): When the stimulus was ambiguous (0% correlation), fluctuations in MT firing predicted the monkey’s choice of near vs. far. A choice probability > 0.5 indicated that trial-to-trial neural variability drives perceptual variability.

Together, these experiments provided strong causal evidence that MT neurons govern how the brain perceives motion and depth in 3D.


Decision Making

Q1: Response field vs receptive field in LIP

Question: 1) Why do we refer to the area in the visual field that an LIP neuron will respond to as its “Response field” rather than its “Receptive field”?

Answer: A receptive field refers to the region of visual space where a stimulus directly activates a neuron through sensory input—for example, a V1 or MT neuron responding to a moving bar or dots in its visual field. In contrast, LIP (lateral intraparietal area) neurons do not respond to visual features themselves. Instead, they respond when that location in space is relevant for behavior, such as being the target of a planned eye movement or decision choice. Because their activity reflects decision- or action-related relevance rather than direct sensory stimulation, the area they correspond to is called a Response Field (RF) rather than a receptive field.

Q2: LIP activity beyond saccade generation

Question: 2) What response characteristic reveals that LIP neurons are more than guiding/reflecting saccadic eye movements?

Answer: LIP neurons do not just fire when an eye movement is executed—they show gradual, ramping activity before the saccade, while the decision is still forming. In the random-dot motion discrimination task, their firing rate ramps up proportionally to accumulated sensory evidence supporting the choice that falls within their response field. The ramp slope is steeper when the motion signal is strong (easy decision). The activity reaches a threshold at the time of the decision, not necessarily at movement onset. This shows that LIP neurons are part of the decision-making process, not just motor preparation—they represent accumulated evidence for a choice.

Q3: Demonstrating evidence accumulation in humans

Question: 3) How can we easily demonstrate (experimentally) that humans accumulate evidence in a perceptual decision task?

Answer: Use a random-dot motion discrimination experiment where dot coherence and viewing time are manipulated. When participants are given longer viewing times (or stronger motion coherence), accuracy increases and reaction times decrease. If you plot accuracy vs. viewing time, the pattern fits a drift-diffusion model (DDM)—evidence gradually accumulates until a decision boundary is reached. → This demonstrates that humans (and animals) make perceptual decisions by accumulating noisy sensory evidence over time rather than making an instant judgment.

Q4: Speed–accuracy trade-off

Question: 4) What do we mean by “speed–accuracy trade-off”?

Answer: The speed–accuracy trade-off describes the relationship between how quickly and how accurately a decision is made: if a person responds quickly, they make decisions before much evidence has been gathered—increasing errors (fast but risky). If they wait longer, they accumulate more evidence—leading to higher accuracy but slower responses. In drift-diffusion terms, this corresponds to changing the decision boundary height: lower boundary = faster decisions, more errors; higher boundary = slower decisions, more accuracy.

Q5: DDM explains faster and more correct decisions with more evidence

Question: 5) Use the drift-diffusion model to explain why humans usually make faster and more correct decisions when having access to lots of sensory evidence compared to when they have little information.

Answer: In the DDM, the drift rate represents the strength of sensory evidence. A high drift rate means evidence accumulates quickly toward one boundary (decision). When sensory input is strong (e.g., high dot coherence): the drift rate is large, so the decision variable reaches the boundary faster → shorter reaction time. Because the accumulated evidence is more reliable, accuracy is higher. When evidence is weak (low coherence), the drift rate is shallow—accumulation is slower, more variable, and often hits the wrong boundary → slower and less accurate decisions. So, strong evidence both accelerates and improves decision-making.

Q6: LIP response properties relevant to decisions

Question: 6) What are the characteristic response properties of LIP neurons that suggest that these neurons play an important role in the decision process of the motion discrimination task?

Answer: LIP neurons show ramping activity that mirrors evidence accumulation during decision-making:

  • Their firing rate increases gradually as sensory evidence builds up in favor of the choice within their response field.
  • The rate of ramping depends on motion strength—faster for strong signals, slower for weak ones.
  • Their firing reaches a consistent threshold at the time of decision commitment, regardless of task difficulty.
  • When motion is ambiguous, trial-by-trial variability in their firing correlates with behavioral choice (similar to MT–choice probability logic).

These features mirror the integrator in the DDM, making LIP a key node in transforming MT sensory input → decision signal → motor plan.

Q7: What is learned in task training?

Question: 7) What is “learned” when monkeys learn the motion discrimination task?

Answer: Monkeys don’t necessarily learn to change the sensitivity of MT neurons (which encode sensory evidence). Instead, learning strengthens the readout or weighting of sensory input by decision-making areas (like LIP): with training, LIP neurons become better at integrating MT activity over time to form a stable decision variable. MT neurons’ tuning to motion direction remains largely unchanged, but the LIP–MT connection becomes more efficient and predictive of the animal’s choice. In other words: learning improves the link between sensory evidence and decision formation, not the raw sensory representation itself.


Object Recognition

Q1: Latency to IT

Question: 1) How long does it approximately take for the visual information to arrive in area IT?

Answer: Visual information reaches inferior temporal (IT) cortex roughly 100 milliseconds (≈50–100 ms) after stimulus onset. This rapid feedforward processing shows how efficiently the ventral pathway can extract complex object information from raw retinal input.

Q2: Major cortical stages to IT

Question: 2) What are the major cortical areas the information passes on its way to IT?

Answer: Information flows along the ventral (“what”) pathway:

  • Retina → LGN → V1 → V2 → V4 → IT → parahippocampal & prefrontal areas

Each stage increases feature complexity and receptive field size:

  • V1: edges, orientations, contrast
  • V2: combinations of edges, textures, border ownership
  • V4: curvature, color, intermediate shape parts
  • IT: complete object and category representations (faces, tools, animals)

Q3: Two extreme hypotheses for IT representation

Question: 3) What are the two extreme hypotheses for how object information is represented in the population of neurons in area IT?

Answer:

  • Sparse / Grandmother Cell Hypothesis: Each object is represented by one highly selective neuron (e.g., a “Jennifer Aniston neuron”). → Simple but unrealistic because it would require an enormous number of neurons.
  • Distributed Population Code Hypothesis: Every object is represented by a pattern of activity across many neurons, where each neuron participates in representing multiple objects. → More robust, efficient, and biologically plausible.

Q4: How is object information actually represented in IT?

Question: 4) With regard to the above hypotheses: how is object information actually represented in area IT?

Answer: Evidence supports the distributed population code. Individual IT neurons are selective, but not uniquely tuned to one object. Each neuron contributes partially to many object representations. The brain decodes object identity from the population activity pattern, not from single neurons. Thus, IT represents global object structure through tolerant, distributed coding—enabling recognition even when the object changes in position, size, or viewpoint.

Q5: Differences between V1 and V2

Question: 5) Neurons in area V2 have in many regards similar tuning properties as neurons in V1; where do they differ though?

Answer: V1 neurons are tuned to simple, local features like orientation, edges, and spatial frequency. V2 neurons integrate these features and are more selective for combinations such as textures, contours, and border ownership. They cluster into regions specialized for different visual features, making V2 a bridge between low-level and mid-level processing.

Q6: Feature represented in V4

Question: 6) What is the stimulus feature that seems to be specifically and explicitly represented in neurons in area V4?

Answer: Area V4 neurons are tuned to curvature, color, and shape subparts. They respond selectively to specific curvatures or contour combinations, forming building blocks for more complex shapes. V4 is where shape fragments and color features begin to combine into recognizable parts of objects.

Q7: Representational similarity

Question: 7) What do we mean by “representational similarity”?

Answer: Representational similarity refers to how similar or different the neural activity patterns are for different stimuli. If two stimuli evoke similar neural patterns, they are perceived as similar in category or appearance. In IT, representations of objects from the same category (e.g., different faces) are close in neural space, while different categories (e.g., faces vs chairs) are far apart. It’s a way to describe how the brain organizes visual information based on relationships between objects.

Q8: Invariance and selectivity along the pathway

Question: 8) Invariance and selectivity increase along the visual pathway; what does this mean, and why does this make sense for object recognition?

Answer: Selectivity = neurons become more specific to complex features (from edges → shapes → whole objects). Invariance (tolerance) = neurons respond similarly even when the object changes in position, size, rotation, or background. As we move from V1 → V2 → V4 → IT, neurons become both more selective and more tolerant. This makes sense because the goal of the ventral stream is object recognition, not exact pixel matching. We need to recognize an object as the same despite variations in viewpoint or lighting—hence, higher areas emphasize identity over appearance.

Q9: Untangling representations in IT vs V4

Question: 9) In what way can we think of the neural representation in area IT to be more untangled than in area V4?

Answer: “Untangling” means that the neural representations of different objects become more linearly separable—easier to distinguish using a simple decision boundary. In V4, object manifolds (the sets of all possible transformations of an object) are tangled—overlapping in neural space. In IT, the same object’s transformations (position, scale, pose) cluster closely together, while different objects are far apart. → The IT representation is more untangled, allowing simple linear classifiers (like downstream neurons) to separate categories efficiently.

Q10: Alternating operations to untangle representations

Question: 10) What are the two functional operations that are performed alternately and repeatedly by the different neural populations of the ventral stream that ultimately lead to untangled neural object representations in IT cortex?

Answer: The two operations are:

  • Feature Combination (Integration): Combining simpler visual elements (edges, colors, curves) into more complex shapes and objects.
  • Pooling (Invariance): Generalizing across variations (position, size, rotation) to build tolerance.

These alternating operations—integration → pooling → integration → pooling—progressively transform raw visual input into untangled, invariant object representations in IT.

Q11: Prosopagnosia and face representation

Question: 11) What is “Prosopagnosia”, and what does it tell us about how faces are represented in IT cortex?

Answer: Prosopagnosia is the inability to recognize familiar faces despite normal vision and intelligence. It often results from damage to the Fusiform Face Area (FFA) in the inferior temporal cortex. This shows that the IT cortex contains specialized subregions (like the FFA) that are highly tuned for face identity—though faces are still represented via population coding rather than single “face neurons.” Thus, IT encodes specific object categories (faces, places, bodies) within specialized but distributed networks.