Clinical and Research Limitations in the Use of Phallometric Testing with Sexual Offenders

William L. Marshall

[Sexual Offender Treatment, Volume 1 (2006), Issue 1]


In many settings, particularly in North America, phallometric evaluations of sexual arousal are routinely conducted with sexual offenders and these evaluation procedures also serve as research instruments. There are, however, problems with the psychometric bases of these assessments, and studies reporting their use have so many idiosyncratic features that comparisons are of dubious value. Evidence concerning the reliability and criterion validity of phallometric testing leaves a lot to be desired, although the research has suggested a limited value in predicting subsequent recidivism. On the bases of these observations the routine use of phallometric assessments as part of the evaluation of sexual offenders is not recommended.

Key Words: Phallometry, sexual offenders, validity, reliability, limitations


Phallometric testing is a procedure for determining the sexual preferences of males by measuring penile erection responses to stimuli depicting various sexual behaviors that may or may not involve various types of partners. Preferences for almost any variation on sexual activities have been assessed by this technology. It has been used to determine, for example, sexual preferences for forced or consenting sex among university students who admit or deny a proclivity to rape (Malamuth & Check, 1983). Much of this type of research with nonoffenders is aimed at clarifying the role of various factors in instilling a propensity to be forceful in a sexual context (Malamuth, 1984).

With identified sexual offenders, phallometric assessments have played a role in evaluating theories of sexual deviance, in determining treatment needs, in evaluating treatment effects, and in estimating risk to reoffend. Since sexual offending is an unfortunately high frequency crime in our societies causing psychological damage to many innocent people (Haugaard & Reppucci, 1988; Russell, 1984), any measurement procedure that assists in making crucial decisions about identified offenders, or helps us to understand these men, will obviously be valuable. At the same time, however, decisions about sexual offenders made, at least partly, on the basis of phallometric results can have very serious implications for the offenders. To address both the need to adequately evaluate sexual offenders, and to properly protect the rights of these men, phallometric testing must be shown to meet at least reasonably adequate psychometric standards. The present review will attempt to clarify the gaps in our knowledge about phallometric testing with sexual offenders, and point to what is needed to increase the adequacy of the empirical bases of phallometry with these men. While the present review addresses research only with sexual offenders, the results of such a review should also be seen to be relevant to the use of phallometric testing with any population.

Phallometry with SexuaL Offenders

Phallometry was first developed for use with sexual offenders by Kurt Freund (1957) in Czechoslovakia. He developed a device that measured volume changes in the penis in response to various sexual and nonsexual stimuli. The early versions of the volumetric device were cumbersome, expensive, and tended to breakdown rather frequently. This prompted Bancroft, Jones and Pullan (1966), and Barlow, Becker, Leitenberg and Agras (1970), to develop simpler alternatives measuring changes only in the circumference of the penis. The volumetric measure, which appears to be the more sensitive of the two devices (Freund, Langevin & Barlow, 1974; McConaghy, 1974), describes changes in all aspects of the penis (i.e., length and circumference changes), whereas the circumferential measures describe only one aspect of these volume changes. Earls and Marshall (1982) demonstrated that important information about erectile responses may be lost when only circumferential changes are described, and McConaghy (1989) has made a case for distinguishing data generated by volumetric and circumferential procedures. McAnulty and Adams (1992), on the other hand, concluded from their consideration of the available data that there are more similarities between the products of these two devices than there are differences. Certainly the majority of reports in the literature describe the use of only the circumferential devices. For the purposes of this review data from both procedures will be assumed to be essentially equivalent.

There have been several previous reviews of sexual preference testing with sexual offenders (Barbaree, 1990; Barker & Howell, 1992, Earls & Marshall, 1983; Murphy & Barbaree, 1994; O’Donohue & Letourneau, 1992; Simon & Schouten, 1992); however, there are problems with all of these reviews. With the exception of Murphy and Barbaree (1994), these reviews have been quite limited in scope (e.g., addressing only the court use of phallometric data, or focussing only on its use with child molesters). In addition, Murphy and Barbaree took a rather uncritical perspective on phallometric testing and did not satisfactorily point to areas that needed to be addressed.

The present review will consider some of the general problems with research using phallometry and some of the differences between studies that make comparisons difficult. Next, those studies bearing on the internal consistency, test-retest reliability, and criterion validity of these procedures with sexual offenders will be examined. All attempts will be made to identify problems, and offer suggestions for future research that may help to resolve these problems. Finally, the present limitations to the clinical use of phallometry with sexual offenders will be noted. First, however, it is necessary to define the population of sexual offenders whose responses will be considered and point to the heterogeneity of their characteristics.

Populations Studied

The definition of sexual offenders, for the purpose of this review, will be limited to mature males who either coerce an adult female to have sex with them, or have sex with a child, or expose their penis to unwilling females. There are, of course, other types of sexually offensive acts (e.g., voyeurism, frottage, bestiality, necrophilia) and other types of sexual offenders (e.g., women and juveniles), but phallometry has either had limited or no application to these populations. Some researchers and clinicians have used phallometry to evaluate male juveniles who have sexually offended.

Even within the limited types of sexual offenders under consideration in this review, there is such evident heterogeneity on almost all characteristics that have been measured (Marshall & Fernandez, 2003a) that it would be unreasonable to expect phallometry to accurately identify all such offenders. Sexual offenders also vary in terms of the type and number of victims they have abused and in the frequency of assaults on the same victim; these differences may affect the outcome of phallometric tests. For example, incest offenders characteristically have fewer victims than nonfamilial child molesters, but they often molest the same child repeatedly over many years. If experience with sexual molestation plays a role in creating or enhancing deviant preferences, then from what is known about stimulus generalization (Pearce, 1986), we would not expect incest offenders to display arousal to other unfamiliar children. Studies of stimulus generalization indicate that the broader the samples of a class of stimuli a person has reinforcing experience with the broader will be the generalization gradient. Nonfamilial child molesters with numerous victims should, therefore, display arousal to unfamiliar children, which is just the stimuli presented in typical phallometric assessments. However, incest offenders should not display arousal to novel children since they have not sampled broadly enough. Indeed, the experiences of incest offenders should produce stimulus discrimination (Schwartz, 1984) resulting in a generalization gradient that is steep and narrow. Thus, incest offenders should generate erectile responses only to their own victims or to children remarkably similar to their own victims. At phallometric assessments using visual stimuli, then, incest offenders should display normative responding. Freund, Watson and Dickey (1991) report data that essentially confirm these expectations.

Early conditioning theories (e.g., Abel & Blanchard, 1974; McGuire, Carlisle & Young, 1965) appeared to suggest that all sexual offenders would display conditioned arousal to their deviant acts. Contrary to these expectations, most current researchers appear to expect only a limited sample of sexual offenders to display deviant responses at phallometric testing. For example, some researchers (e.g., Freund & Blanchard, 1989) claim that among child molesters only pedophiles will show deviant arousal. Presumably only some rapists are expected to display deviant responses. While this is a reasonable point of view, it does make it difficult to state in advance what proportion of, and who among, any specific type of sexual offender should produce deviant responding. This significantly confuses the issue of establishing the criterion validity of phallometric assessments.

While a significant number of child molesters may be pedophilic, there are problems in the application of the diagnostic criterion (Marshall, 1997). In addition, phallometry is often used as one of the procedures for determining pedophilia (see Freund & Blanchard, 1989) making the claims of Freund (1967b) appear circular; that is, only those child molesters who display phallometric arousal to children are pedophiles and pedophilia is diagnosed by phallometric responses. If we could reliably diagnose pedophilia independently of phallometry, then we could test the possibility that these child molesters are the only ones to display deviant arousal at assessment. Unfortunately, DSM-IV is not at all helpful with rapists, except sadists, and these sadists may be the only rapists to appear deviant at phallometric assessments. As we will see, the research reported to date on rapists can be interpreted to support this expectation.

Perhaps the most significant problem with phallometric evaluations of sexual offenders is the fact that the assessments are not ecologically sound. Blader and Marshall (1989) pointed out, for example, that when a rape occurs the man engages in aggressive behavior at the same time as he becomes sexually aroused. Phallometric evaluations of rapists may depict aggressive acts in the content of sexual acts but that does not create aggressive responses in the subject being tested. Blader and Marshall argued that the essence of rape is the concurrent evocation of aggressive and sexual states in the offender, and that it may be that in nonoffenders these states are incompatible. Certainly we know that many sexual offenders are in particular states at the time of offending (e.g., many are angry or intoxicated), and yet these states are not present during phallometric testing. When we (Yates, Barbaree & Marshall, 1984) made normal males angry at a woman and then measured their erectile responses they showed significant increases in arousal to rape. Similarly, when we intoxicated normal males, their arousal patterns to consenting sex and rape changed in the direction of a preference for rape (Barbaree, Marshall, Yates & Lightfoot, 1983). These studies suggest that if phallometric evaluations are to have any value they need to attempt to replicate the conditions (internal and external) that prevail when sexual offences occur. However, intoxicating or angering sexual offenders would not seem to be a wise strategy and would likely not be tolerated within most clinical settings.

Psychometric Standards

To properly evaluate the psychometric adequacy of phallometric measures, it is necessary to identify the standards against which these procedures will be assessed. For any test to be of value it must be shown to be reliable and valid (Kline, 1993), although the standards for acceptable levels of reliability and validity vary according to a variety of factors. Just what should be counted as satisfactory levels for the phallometric test is not clear, but, for example, values above r = .6 for test-retest reliability are said to be the minimal level when the issue of concern has trivial implications (Hair, Anderson, Tatham & Black, 1998; Murphy & Davidshofer, 1988). For decisions that have important implications for the person being tested, or for public safety, then a value of r = .9 is considered the standard. For criterion validity, phallometric results demonstrating differences between criterion groups should be reasonably consistently replicated. For predictive validity it should be shown that scores on the test bear a consistent relationship with subsequent behavior.

In order to achieve acceptable standards of reliability and validity, it is necessary to have a test that is at least reasonably standardized. There have been repeated calls for the standardization of phallometric tests (Barker & Howell, 1992; Howes, 1995) but to date this has not been done. Replication of results would be somewhat surprising if the measures used in different studies markedly differed. While there is similarity in the overall features of the various approaches to phallometry (e.g., they all measure erectile responses to various, more or less agreed upon, categories of stimuli), across centres there are numerous variations and it is difficult to determine their influence. Certainly, the present lack of standardization presents a problem when comparing studies (Murphy & Barbaree, 1994).

The Problem of Faking

Perhaps the main threat to phallometric assessments concerns the fact that the ability to fake sexual response patterns has been demonstrated for both sexual offenders and nonoffenders (Murphy & Barbaree, 1994). Several studies have shown that normal subjects can significantly inhibit their arousal by using mental activities to distract themselves, despite a clear indication that they were attending to the stimuli (Henson & Rubin, 1971; Laws & Rubin, 1969). Similarly, numerous studies have shown that rapists and child molesters (Avery-Clark & Laws, 1984; Hall, 1989; Hall, Proctor & Nelson, 1988; Laws & Holmen, 1978; Quinsey, Steinman, Bergersen & Holmes, 1975; Wydra, Marshall, Earls & Barbaree, 1983) are able to both inhibit arousal to preferred stimuli and generate arousal to nonpreferred stimuli.

Quinsey and Chaplin (1988), Malcolm, Davidson and Marshall (1985) and Marshall (2004), have all developed procedures aimed at preventing faking and while their data offer encouragement, their procedures involve a quite different and more laborious approach than is usual in phallometric testing. In so far as faking subjects use cognitive strategies evident only to themselves, it appears virtually impossible to prevent or detect dissimulation; thus, faking will always constitute some undetermined degree of threat to the validity of phallometric assessments.

Differences between Studies

Subject Differences

Although many sexual offenders deny having committed an offense, very few phallometric studies indicate whether or not they exclude deniers, even though Freund (Freund & Blanchard, 1989; Freund, Chan & Coulthard, 1979) has found that convicted child molesters who deny they have an interest in children typically show normal responses. Thus, if a study includes deniers, there may be no differences between the sexual offender group and normal controls, whereas excluding deniers may reveal differences.

In attempting to determine the distinctiveness of the responses of sexual offenders, researchers have typically compared them to men who are presumed to be nonoffenders. Unfortunately, since some normal males display an interest in deviant sex (Finkelhor, 1979; Langevin, 1988), it is not a simple task to identify a sample of men who are clearly not offenders. In our research we have comparison subjects indicate anonymously whether they have either fantasized or committed a sexual offense. This typically leads to the exclusion of between 35-45% of the subjects. We could find no other phallometric studies that screened nonoffenders in this way.

Some studies employ quite small numbers and it is hard to know how representative they are, particularly because sexual offenders are quite heterogeneous. In addition, different researchers choose their samples from quite different settings (e.g., from prisons, psychiatric institutions, outpatient settings) and this seems certain to produce results that are not generalizable to offenders in other settings.

Stimulus differences

The stimuli used in studies differ in terms of the chosen modality of presentation (i.e., film, videotapes, slides, audiotapes, or covert images) and within this the stimuli vary along several dimensions (e.g., black-and-white vs. colors, presence or absence of background, single vs. two or more figures). While Abel and Blanchard (1976) found that videos generated the greatest levels of arousal, it appears that such strong arousal obscures differential responding.

Stimulus duration also varies across studies, despite the demonstrated importance by Avery-Clark and Laws (1984). They showed that arousal did not reach maximal levels of discrimination until audio stimuli had been presented for 3 minutes. Of perhaps greater importance is the duration and temporal location of the sexually significant events in audiotaped stimuli. If the supposedly normative and deviant stimuli do not have the same sexually significant elements occurring in the same temporal location, and for the same duration, then the resultant arousal patterns may differ not because the stimuli depict normative or deviant events but simply because they depict differing sexual elements. Barbaree, Marshall and Lanthier (1979) generated stimulus sets that were matched for the temporal location and duration of the sexual elements, but they did not determine empirically what elements were sexually significant. This may be a problem because Abel, Barlow, Blanchard and Mavissakalian (1975) showed that different subjects responded differentially to the various elements in their sexual stimuli, as did Laws (1984). Of course, if all sexual offenders prove to be this idiosyncratically aroused, then producing a standardized test that identifies deviance, may be impossible.

The fact is there have been no empirical determinations of what constitutes the appropriate content of stimuli for preference testing. Since Lalumičre and Quinsey (1994) claim that different findings across studies with rapists are due to differing stimulus content, then obviously this is potentially a very important issue.

Response Differences

Phallometric responses may be reported as raw scores, percentages of full erection, z scores, or as ratios of responses to deviant and appropriate images (Barbaree, 1990), and there is debate about the advantages and disadvantages of these various methods (Barbaree & Mewhort, 1994; Earls, Quinsey & Castonguay, 1987). Furthermore, each scoring system can be derived from either average or peak responses to each stimulus set, or from latency to respond. Presently there is insufficient data to guide us on these alternatives but a report by Murphy (1987) is revealing. He noted that the results of one of his earlier studies (Murphy, Haynes, Stalgaitis & Flanagan, 1986) disagreed with the observations of a similar study by Marshall et al. (1986). Murphy pointed out that Marshall et al. entered peak responses into their data analyses, whereas he had entered average responses. When Murphy converted his earlier data to peak responses, the differences between the two studies disappeared.

There is also the problem of deciding what constitutes a response of sufficient magnitude to declare it to be greater than random fluctuations and therefore meaningful (Laws & Osborn, 1983). Some authors (Barbaree et al., 1979; Laws & Osborn, 1983) exclude from analyses those subjects whose responses are below an arbitrary criterion, but Harris, Rice, Quinsey, Chaplin and Earls (1992) found that including low response subjects did not affect the sensitivity of their phallometric assessments. Barbaree and Mewhort (1994), however, demonstrated that z score transformations can be distorted by the inclusion of low responders. What to do with low responders and how to appropriately represent phallometric data, (ie, z scores, precent full erection, etc.) are two frequently confounded issues upon which there is presently no agreement.


Given the extensive use of erectile measures of sexual preferences with sexual offenders (Howes, 1995; Knopp, 1984), and the rather long time over which such measures have been in use (Bancroft, et al., 1966; Barlow, et al., 1970; Freund, 1957), it is surprising that so few reliability studies have been reported.

While the internal consistency of the stimuli used in any test is a relevant issue, it is difficult to know how to address this with phallometry. For example, a stimulus set depicting adult females typically includes several different women in the hope that the respondent’s idiosyncratic preferences for women’s features (e.g., hair color and length, breast size) will appear in at least one of the stimuli. Given this basis for determining the stimulus set we would not expect high levels of internal consistency (i.e., how well each stimulus generates the same arousal as each other stimulus). Despite this expectation, indices of the internal consistency of phallometric stimuli have been reported to be satisfactory (Day, Miner, Sturgeon & Murphy, 1989; Quinsey & Chaplin, 1984; Wormith, (1986). It is, however, the test-retest reliability of phallometric assessments that is the critical issue. Test-retest reliability is an important issue for both diagnosis and assessment, but most particularly, for evaluating the effects of treatment. One of the problems inherent in examining reliability over time concerns habituation which has been found to occur both within and across assessment sessions (Eccles, Marshall & Barbaree, 1988; O’Donohue & Geer, 1985; O’Donohue & Plaud, 1991). However, Eccles et al. (1988) found that relative responding (in their case, to lesbian and heterosexual scenes) remained essentially the same. The best estimate of test-retest reliability, then, should employ a ratio measure to eliminate the problem of habituation.

Wormith’s (1986) study of responses to child stimuli examined data from two test sessions occurring one week apart. The low resultant coefficient (r = .53), however, was rendered even less compelling because Wormith collapsed over different offender types. Davidson and Malcolm (1985) examined the responses, over a 6-day interval, of 90 rapists. They reported marginally adequate reliability (r = .65) for the rape index (a ratio measure). Barbaree, Baxter and Marshall (1989), however, found far less encouraging data using the same stimuli and drawing subjects from the same setting. Barbaree et al. calculated reliability independently for 60 rapists and 40 nonoffenders over a one-week interval. They found unsatisfactory coefficients for both groups (r = .44 for rapists; r = .29 for nonrapists). Using a criterion of progressive exclusion, Barbaree et al. found that it was not until all those subjects with less than 75% full erection were excluded that reliability estimates reached acceptable levels (r = .74 for rapists; r = .79 for nonrapists). However, by this time 75% of the rapists and 56% of the nonrapists had been eliminated from the data set. Fernandez and Marshall (2004a) determined the test-retest reliability of the phallometric assessments of rapists and child molesters over an average between test interval of 6 months. This test-retest interval is, of course, more in line with the gap between clinical evaluations done at pre and post-treatment; Fernandez and Marshall’s study is, therefore, of more direct clinical relevance than the earlier, short-term test-retest reliability studies. Unfortunately, none of the indices calculated by Fernandez and Marshall approached minimal levels of test-retest reliability.  This study, then, presents a very serious challenge to the empirical status of phallometric evaluations. If a test cannot be shown to be reliable, then the issue of its validity is moot, and its use should, accordingly, be abandoned.

Criterion Validity

Despite the comments about reliability and their implications, validity issues will be appraised for the sake of a complete evaluation of phallometry. While various types of validity are important to the psychometric bases of phallometry, the concern here will only be with the capacity of these measures to distinguish sexual offenders from other individuals. For reviews of the literature bearing on other types of validity with phallometry, the reader is referred to Murphy and Barbaree (1994) and O’Donohue and Letourneau (1992). For convenience, research with the different types of sexual offenders will be considered separately.

Child Molesters

Freund (1965) was the first to report a comparison between child molesters and nonoffenders. He compared age preferences determined by volumetric responses and found that the men who had molested young girls displayed greater arousal to visual images of young children than they did to adults; the normal subjects showed greater arousal to adults than to children. In a series of subsequent studies, Freund essentially replicated these findings and extended them to male-victim child molesters (Freund, 1967a, 1967b; Freund & Blanchard, 1989; Freund et al., 1979; Freund, Scher, Chan & Ben Aron, 1982). Freund, however, failed to note in these studies that he had selected only those child molesters who admitted to having offended against multiple victims; he reported this much later (Freund, 1991).

Similar findings of group differences between nonfamilial child molesters and nonoffenders have been reported by numerous other researchers (Abel, Becker, Murphy & Flanagan, 1981; Baxter, Marshall, Barbaree, Davidson & Malcolm, 1984; Day et al., 1989; Frenzel & Lang, 1989; Grossman, Cavanaugh & Haywood, 1992; Lang, Black, Frenzel & Checkley, 1988; Marshall et al., 1986; Marshall, Barbaree & Butt, 1988; Murphy et al., 1986; Quinsey & Chaplin, 1988; Quinsey, Chaplin & Carrigan, 1979; Quinsey et al., 1975; Wormith, 1986). However, not all of these observed differences are as clear-cut as this summary suggests. For example, Baxter et al. (1984) reported that child molesters who had offended exclusively against post-pubescent females (ages 12-16 years) displayed preferences that matched those of rapists and were quite different from those of offenders who molested children under age 12 years. Malcolm et al., (1993) replicated these findings, and Hall et al. (1988) found that rapists were equally aroused by child stimuli as were child molesters. Furthermore, Marshall et al. (1988) demonstrated that men who molested male children had more complex sexual preferences than was expected. They found that two-thirds of these men were, in their responses to adults, clearly heterosexual; the remainder preferred adult males. The „heterosexual” offenders chose as victims boys who were clearly prepubescent (average age 7 years with no victims over age 11 years), while the „homosexuals” typically chose pubescent boys (average age 12.5 years). This had not been observed previously and needs replicating before too much can be made of it, but it does suggest a more complex picture than is implied by the simply assessment of age preferences.

Of perhaps greater significance is the observation that it is only those nonfamilial child molesters with multiple victims who display deviant responses at phallometric evaluations. Freund, Watson and Dickey (1991) showed that single victim child molesters essentially appeared normal at phallometric testing (i.e., they showed no arousal to children but heightened arousal to adults) whereas multiple victim child molesters displayed strong arousal to children and far less arousal to adults. Barbaree and Marshall (1985) similarly found that offenders with multiple victims responded with high arousal to children.

The majority of studies have found that incestuous offenders respond to adult and child stimuli in much the same way as do nonoffenders (Frenzel & Lang, 1989; Freund et al., 1991; Grossman et al., 1992; Marshall et al., 1986; Murphy et al., 1986; Quinsey et al., 1975, 1979), although two studies have reported greater arousal to children than to adults (Abel et al., 1981; Murphy et al., 1986). These latter two studies used audiotaped descriptions, whereas the reports of normative responding by incestuous offenders used visual slides. Illustrating the importance of this, Lang et al. (1988) found deviant responses among incestuous offenders to audiotapes but more normal responses to visual slides. As Murphy and Barbaree (1994) suggest, it may be that audiotapes allow incestuous offenders to imagine their own victims to whom they may be sexually attracted, whereas slides of unfamiliar children may preclude arousal. This possibility was confirmed by Fernandez and Marshall (2004b) who found that very few incest offenders were aroused by visual images of naked children whereas almost 70% of incest offenders were significantly aroused by audiotaped descriptions of an adult having sex with a child. These findings, and the corresponding finding by Fernandez and Marshall (2004b) that nonfamilial child molesters were equally aroused by visual and auditory stimuli, fit with the earlier comments in this paper about generalization and discrimination among child molesters. If incest offenders only show arousal to children when the stimuli (audiotaped descriptions) are sufficiently vague as to allow them to imagine their own specific victim, then the surprising thing is that only 70% respond deviantly. It is not at all surprising that incest offenders, who typically molest their victims repeatedly, are sexually aroused by imagining molesting their victims, so it is hard to see what phallometric appraisals add. Similarly, if the only nonfamilial child molesters who appear deviant at phallometric testing are those with multiple victims, then again it is not clear what additional information is provided by phallometry. Men who molest several children are clearly sexually aroused by children; we do not need phallometric tests to know that.


Phallometric studies of rapists typically measure responses to audiotaped descriptions of consenting sex and forceful nonconsenting sex. From these data a rape index may be calculated by dividing responses to rape by responses to consenting sex, or by subtracting arousal to consenting sex from arousal to rape.

A series of early studies examining small samples of subjects consistently found that rapists differed from nonrapists in their erectile responses to these stimuli (Abel, Barlow, Blanchard & Guild, 1977; Abel, Blanchard, Becker & Djenderedjian, 1978; Barbaree et al., 1979; Quinsey & Chaplin, 1982, 1984; Quinsey, Chaplin & Upfold, 1984; Quinsey, Chaplin & Varney, 1981). However, even within this group of studies there were inconsistent findings. For example, Quinsey et al. (1984) found greater arousal to rape than to consenting sex among rapists; Abel et al. (1977, 1978) reported that rapists were equally aroused by rape and consenting sex; and Barbaree et al. (1979) found rapists to be less aroused to rape than to consenting sex, but their responses to rape were, nevertheless, greater than the responses of nonoffenders. These are, indeed, confusing results.

At least three subsequent studies (Earls & Proulx, 1986; Freund, Scher, Racansky, Campbell & Heasman, 1986; Rice, Chaplin, Harris & Coutts, 1994) have reported data consistent with the findings of Quinsey et al. (1984). However, examination of the graphs presented by Rice et al. (1994) reveals that the responses of rapists to the forced sex stimuli were very nearly identical to the responses shown by the nonoffenders; the responses of the rapists to consenting sex, on the other hand, were substantially lower than those displayed by controls. This suggests that the rapists in Rice et al.’s study were, in fact, inhibited by both the absence of force and by the consenting nature of the female. Barbaree (1990) reported the detailed analysis of a sadistic rapist’s responding that matched that of Rice et al.’s rapist group, suggesting that Rice et al. may have been examining a group of sadistic rapists. This is a point we (Blader & Marshall 1989; Marshall & Eccles, 1991) have repeatedly made about the studies by Rice, Harris, Quinsey and their colleagues.

We (Blader & Marshall, 1989; Marshall & Eccles, 1991) have suggested that sadistic rapists should be expected to show a preference for forced sex, particularly where violence is involved in the depiction. Nonsadistic rapists, we suggested, might not display sexual arousal to rape and several studies with large numbers of subjects have, in fact, failed to discern differences between rapists and nonrapists (Baxter, Barbaree & Marshall, 1986; Baxter et al., 1984; Hall, 1989; Hall et al., 1988; Langevin, Paitich & Russon, 1985; Murphy, Krisak, Stalgaitis & Anderson, 1984; Wormith, Bradford, Pawlak, Borzecki & Zoher, 1988). As a result, we proposed that differences in the proportion of sadists across studies accounted for the observed inconsistent results with rapists.

Lalumičre and Quinsey (1993, 1994) have countered our claim. Their meta-analysis (Lalumičre & Quinsey, 1994), in particular, convinced them that the differences in reported findings were the result of stimulus differences and not sample differences. Support for this perspective is derived from two studies that manipulated stimulus content. Harris et al., (1992) found that the best discrimination between rapists and controls was obtained when the rape stimuli were made particularly brutal, and Proulx, McKibben and Coté (1994) reported that rapists differed from nonrapists only when the rape stimuli contained significant elements of humiliation. However, the subjects both in these studies, and the majority of subjects in Lalumičre and Quinsey’s meta-analysis, appear to have involved a disproportionate number of sadists. When we (Eccles, Marshall & Barbaree, 1994) manipulated both brutal features and humiliation in our stimuli, we observed no differences between rapists held in a Canadian penitentiary (very few of whom were sadists) and nonrapists. Even when we employed the same stimuli previously used by Quinsey and his colleagues, we were unable to discern differences between our rapists and controls. The issue, then, is not as simple as Lalumičre and Quinsey would have us believe. Furthermore, the conclusions of Lalumičre and Quinsey’s (1994) meta-analysis is contradicted by a similar meta-analysis performed by Hall, Shondrick and Hirschman (1993) who concluded that the bulk of the research had failed to satisfactorily demonstrate differences between rapists and nonrapists. Finally, Thornton (1998) showed that only those rapists determined by actuarial indices to be at high risk to reoffend, displayed deviant responses. He found that low and moderate risk rapists showed a preference for normative sex at phallometric assessment. If only sadistic rapists, and those rapists at high risk to reoffend, display deviant arousal at phallometric evaluations, then it is hard to see what contribution to our understanding is made by the evaluations. We know in advance of phallometric testing, that sadistic and high risk rapists have extensive problems so it would seem that phallometrics adds nothing to what we already know.

The fact is there really are not enough studies available from a sufficiently broad range of settings using similar stimulus content to satisfactorily address this issue. So long as studies employ quite different stimuli, small subject samples, and do not describe in sufficient detail the characteristics of their subjects, we will not be in a position to define unequivocally the sexual preferences of rapists.


There have been several attempts to develop phallometric assessments specific to exhibitionists. However, these attempts have been quite limited and not always obviously appropriate. More to the point, the resultant data appear unconvincing.

Kolarsky, Madlafousek, and Novotna (1978), in the first reported study of the sexual preferences of exhibitionists, presented them and matched comparison subjects with movie clips of the same female in various seductive and nonseductive poses. The erectile responses revealed no differences between the two groups, but perhaps this is not surprising since there were apparently no instructions to imagine exposing to the women nor were any of the images suggestive of exposing. In a follow-up study (Kolarsky & Madlafousek, 1983), exhibitionists displayed greater arousal than normals to a film showing a fully dressed woman engaged in household tasks. It is not clear why these researchers thought exhibitionists would respond to these scenes since the scenes more closely match the sort of circumstances appropriate to voyeurs rather than exhibitionists. Kolarsky and Madlafousek (1983) also found that normal males were significantly aroused by a film showing a naked woman pointing at her genitals, but this arousal occurred only after these men had seen a prior scene of the same woman fully dressed and displaying erotic behavior. Exhibitionists, on the other hand, were aroused by the naked woman pointing at her genitals in the absence of any prior scene. Exactly what these results mean and why they should differentiate exhibitionists from normal males is not at all obvious. There has been no attempt by any independent group to replicate these findings.

Langevin et al. (1979) compared the erectile responses of 10 exhibitionists and 10 nonoffender subjects to audiotaped descriptions of various acts of exposing and intercourse; they found no group differences. Similarly, in a series of studies, Freund and his colleagues (Freund & Blanchard, 1986; Freund, Scher, & Hucker, 1983, 1984) found no differences in the erectile responses of exhibitionists and nonoffenders to scenes of exposing. Murphy, Abel, and Becker (1980) found essentially the same thing. The approximate average peak response by the 16 exhibitionists in Murphy et al.’s report was 35% full erection to exposing and 55% full erection to the nondeviant stimuli, revealing a clear preference for consenting intercourse among these men.

Maletzky (1980) is the only researcher to date to report clear evidence of deviant arousal among exhibitionists. He described two studies where exhibitionists (N = 20 and N = 30) produced full erections (apparently all of them did since the group means in both studies were 100% full erection) to scenes of exposing. These are quite surprisingly strong responses considering what is usually found in laboratory studies of men’s erectile preferences. In fact, this is the only study in the literature examining any group of subjects to report full erections in all subjects in response to any stimulus. Unfortunately, Maletzky did not compare the erectile response of his exhibitionists to their responses to other nondeviant stimuli or to the patterns of response among nonexhibitionist subjects.

In an attempt to design more relevant stimuli, Fedora, Reddon, and Yeudall (1986) generated scenes of women who were fully clothed and who appeared in various public places where exhibitionists might offend. Again, however, we are not told what instructions, if any, were given to the subjects. There is no reason at all to presume that the exhibitionists would automatically recognize the scenes as prompting exposure, but it seems certain that this would not occur to any of the nonexhibitionist subjects. Exhibitionists in Fedora et al.’s study showed greater arousal to these scenes than did normal subjects but their responses were lower to the exposing images than they were to images of naked females (either alone or engaged in sex acts). If relative arousal reveals preferences, then these exhibitionists did not display deviant responding. Indeed they were most aroused by consenting sex involving adults.

Marshall, Payne, Barbaree, and Eccles (1991) also attempted to produce ecologically relevant scenes in an attempt to maximize the possibility of revealing deviant preferences among exhibitionists. Their audiotapes described either exposing or mutually consenting sexual intercourse occurring in three different places: the man’s apartment, his car, or a secluded park. These locations represented the typical places where the offenders in the study had exposed and they were also places where intercourse might take place. The 44 exhibitionists in Marshall et al.’s study displayed less arousal to the consenting sex scenes and greater arousal to exhibiting than did the 20 community subjects. However, the mean response of the offenders to exposing was 19.2% full erection whereas their average arousal to consenting sex was 42.5%; obviously as a group these exhibitionists did not display a deviant profile even though they showed greater arousal to all stimuli that did normals; both groups displayed profiles that indicated prosocial sexual interests.

Calculating an exhibitionist deviant quotient (EDQ) by dividing arousal to exposing by arousal to intercourse, Marshall et al. (1991) were able to identify only 13.6% of the exhibitionists as deviantly aroused using a conservative criterion of EDQ = 0.8 or higher. Using 10 or more victims as an index, Marshall et al. identified 24 of their exhibitionists as having a chronic history, and yet only 8 of these men met the criterion of displaying deviant arousal. Thus, 16 exhibitionists with an average of 26.9 victims each displayed no preference for exposing but were significantly aroused by consenting heterosexual intercourse. When Marshall et al. asked their exhibitionists what their sexual fantasies involved, they all reported imagining their exposure victims requesting, and then engaging in, intercourse with them. For these men, apparently, exposing is driven by the unlikely possibility that their victims will initiate intercourse with them rather than by a preference for the exposing behavior.

This review of the data on phallometric assessments of exhibitionists reveals little support for the idea that these assessment procedures produce meaningful evidence on these offenders. Except for a very small number of the most persistent exhibitionists, most of these offenders appear normal at phallometric evaluations.

Prediction of Reoffending

If phallometric responses describe critical features of sexual offenders, then they should bear some relationship to the subsequent incidence of reoffending. Quinsey, Chaplin and Carrigan (1980) were the first to report the relationship between phallometrically determined sexual preferences and later recidivism. They found a small but significant relationship between post-treatment deviant arousal and recidivism (measured over a 29-month follow-up) among 30 child molesters. Subsequently, Rice, Quinsey and Harris (1991) found that both pre-treatment and post-treatment deviant indices (a proportional index of deviant to appropriate arousal) were trivially related to long-term treatment outcome (r = -.16 and r = -.06 respectively, i.e., at most 2.6% of the variance in recidivism). Barbaree and Marshall (1988), however, found a somewhat stronger relationship between an index of deviant arousal (a ratio of responses to children versus adults) and recidivism (r = .38) among 35 untreated child molesters. The comprehensive prediction instrument described by Quinsey, Rice and Harris (1995) includes indices of deviant arousal which again, on their own, are weakly related to recidivism (r = -.21). However, in combination with 12 other features of the offenders the deviant indices contribute to a powerful prediction of reoffending.

Hanson and Bussičre (1998) conducted a meta-analysis of 61 studies reporting sexual offender recidivism. A pedophile index derived from phallometric evaluation (i.e., responses to children divided by responses to adults) was the most powerful predictor of recidivism (r = .32). Interestingly, a similarly derived rape index proved to be inadequate in predicting reoffending (r = .05) despite the fact that the total sample included both rapists and child molesters.

It would seem from the bulk of the data presently available that phallometric measures, particularly of sexual interests in children, appear to have promise as somewhat weak predictors of reoffending, although it is also clear that these indices function best as part of a more comprehensive prediction package (Hanson, 1997; Quinsey, et al., 1995).



The most obvious implication of the above review is that descriptions of the various aspects of phallometric studies need to be made clear. We need to know the relevant details of the research subjects and how they were selected, the precise nature of the stimuli presented, and the procedures for monitoring and representing arousal. We also need to know what attempts were made to control for faking.

The lack of standardization of phallometric procedures remains, however, the single most glaring inadequacy in the research literature. Without a standardized procedure, standardized instructions, standardized stimuli, and evidence on the value of the standardized protocol, phallometric assessments will remain vulnerable to challenges, and will fail to provide a basis for comparing results across studies. A cross-centres standardization project is needed before phallometric evaluations can be considered fully legitimate. It is difficult to know what to say about the diverse and frequently contradictory nature of the findings when procedures differ so greatly across studies. Unless some agreement on standardization is reached, the responses of offenders may be idiosyncratically influenced in unknown ways by the procedures peculiar to each setting.

There is no doubt that the issue of the reliability of phallometric testing requires far more extensive examination, than has previously been devoted to this issue. As for criterion validity, we need to develop strategies to reduce heterogeneity. Knight and Prentky’s (1990) classification scheme may serve as one approach, but most particularly we need more thorough evaluations of the differences between sadists and nonsadists, and perhaps high and low risk offenders.

This set of issues for future research does not, of course, exhaust the list of potential influences on phallometric data, but they do appear to represent the most pressing problems.


The present review indicates that those clinicians who rely on phallomtrics must offer compelling arguments for doing so. The evidence on the reliability and validity of phallometrics presently available in the literature, certainly offers little support for its use. Clinicians must, at the very least, provide clear evidence that their particular phallometric procedures are reliable and do validly discriminate offenders from nonoffenders. The value of phallometric assessments in the clinical evaluation of sexual offenders has not, on the basis of the present review, received the kind of support necessary to justify such use. Indeed, some may find justification in the present review for abandoning the use of phallometric assessments altogether.

When phallometric testing is used to evaluate sexual offenders we need to be clear about what these data tell us. It has no place in contributing to decisions about guilt or innocence (Marshall & Fernandez, 2000a) since some nonoffending males find forced sex or sex with children to be attractive (Finkelhor, 1979; Malamuth, 1986) and many sexual offenders respond normally at phallometric assessment (Marshall & Fernandez, 2000b). If deviant responses are evident at phallometric assessment, then treatment needs may have been identified, but it is likely that this would already be known on the basis of the client’s offence history. In fact, even if a sexual offender displays deviant responses at testing these responses may not need to be targets in treatment (see Marshall & Fernandez [2003b] for a full discussion of this issue). If, however, a client displays either normative responses, or fails to show interpretable levels of arousal, then the phallometric evaluations have not advanced our understanding of the client’s needs or problems. He may still have deviant interests, but has successfully hidden them, or his offending may not be driven by deviant sexual interests. Other assessment procedures will always be needed, whatever the results of phallometric assessments (McGovern, 1991).


The goal of this review was to examine the available evidence on phallometric testing and to make recommendations on the basis of the review, about the use of such testing procedures. The evidence, in my opinion, does not justify the routine use of phallometry in the clinical evaluation of sexual offenders. Indeed, at present the evidence does not justify any use of phallometry unless the clinician can provide data showing that his/her specific evaluation procedure is reliable and valid. Even the use of phallometry to explore theoretical issues seems unjustified given the failure in the present review to find support for the psychometric bases of phallometry. Indeed, the findings of such research may be misleading and would certainly require several independent replications before confidence could be placed in the results. At present it is difficult to recommend any use of phallometry; researchers and clinicians are encouraged to seek alternative evaluation procedures.


Author address

William L. Marshall
Rockwood Psychological Services
Suite 403, 303 Bagot Street, Kingston
Ontario, K7K 5W7 Canada




