Transcribing sessions

If you have not transcribed sessions before just write or type out what was said with a new line for T (therapist) and C (client)

The key things to remember are to underline where there is talking over each other and where there are just small “Ums and aahs” you can put those in brackets and not start a new speaker turn

As you get familiar with transcribing it gets easier, but if you want to learn the conventions of transcribing as used in process research there are lots of resources below

Here is a useful summary:

Verbatim Transcription of Research Interviews and Focus Group Discussions

From: Verbatim Transcription of Research Interviews and Focus Group Discussions – Academic Transcription Services (

Posted in: Interview TranscriptionTranscription Mechanics.
Last Modified: September 13, 2021

What is Verbatim Transcription

Poland (1995) defines verbatim audio transcription as the word-for-word reproduction of verbal data, where the written words are an exact replication of the recorded (video or audio) words. With this definition, accuracy concerns the substance of the interview, that is, the meanings and perceptions created and shared during a conversation. And also how these meanings are created and shared during the conversation. So verbatim transcription of research data not only attempts to capture the meaning(s) and perception(s) or the recorded interviews and focus group discussions, but also the context in which these were created.

Why Verbatim Transcription

Whether or not one chooses to get verbatim transcripts for their qualitative data depends on the purposes of the research. Research methods should always reflect research questions. As an important step in data management and analysis, the process of transcription must be congruent with the methodological design and theoretical underpinnings of each investigation.

Verbatim transcripts attempt to capture a word-to-word reproduction of the recorded data. In addition there are 3 distinct vocalizations and nonverbal interaction that verbatim transcripts aim to capture: involuntary vocalizations response/non-response tokens and non-verbal interactions. Transcribing these features of speech can add to the context, and offer clarity, of the discussion or interview.

The mechanics of Verbatim Transcription

In creating verbatim transcripts, it’s important for the transcriber to provide an exact match between what is recorded and what is transcribed into text. While the notion of an “exact match” is problematic given “the intersubjective nature of human communication, and transcription as an interpretive activity,” (Poland, 1995, p. 297) so it’s important to pay attention to:

1) Response Tokens. The common ones are hm, ok, ah, mmh, yeah, um, and uh. There are intentional and many a researcher use them as verbal probes to elicit more information from the interviewee. Research has shown that such vocalizations can provide a great deal of insight into both the nature of conversation and  also the informational content of the conversation (Gardner 2001). Listen to the different pitches and contour of the sound.

2) Involuntary Vocalizations. Sounds such as coughing, sneezing, burping, sniffing, laughing and crying are considered involuntary noises. Background noises, for instance dog barking, sirens, phone ringing (which happens a lot) are also categorized as involuntary vocalizations.  Involuntary sounds that occur during an interview can be meaningful or meaningless to the analyst.

3) Non-verbal Vocalizations. Non-verbal communication includes actions, activities, and interactions of both participant and interviewer. Gesticulations such as pointing, thought checking, fidgeting, head nodding and hand gestures are included as non-verbal interactions. Non-verbal interaction are mostly relevant when transcribing from a video.  As with the other forms of noise, non-verbal interactions can add context and explanation, or create misunderstandings for the researcher.

In addition, it is important to pay attention to the pronunciation and irregular grammar that are essential parts of everyday speech. These offer important insights into a participant’s life and meaning-making that add richness that would otherwise be lost.

Here is a table summary of Verbatim Transcription  Conventions.

Verbatim Transcription  Conventions (Adapted from Tilley & Powick, 2002)

Sounds Thinking before someone speaksum , ah
I’ve never thought of that beforemmh[= ha, huh]
Affirmative soundsyup [=yep], yeah [=yah, yea, ya]
Listening + encouragementumm [=aha, uha, mmm]
Environmental sounds[tapping], [knock at door], [shuffling papers]
Tone of speaker LouderCAPITAL LETTERS
Demonstrative expressions :Words spoken while laughing[laughing]
Laughter when both parties are laughing at something[laughter]
Other[coughing], [sighing], etc.
Pauses +5 seconds[pause]
Interruptionsuse [inter.] where the break happens
Self-talk or repeating what someone else saidUse “quotes”
RepetitionType out the repeated words, words, words
Punctuation: end of thoughta period (.) at the end of the complete idea
end of phrase / clauseuse a comma (,)
thought not completeduse an ellipse . . . as the thought trails off
Cross-talk: two or more speakers speaking at the same time / over each other[CT]
Tape is unclear/ muffled and can’t make out word or phrase of one speaker[inaudible][timestamp]


Powick, K. D. and S. A. Tilley (2002). Distanced Data: Transcribing Other People’s Research Tapes. Canadian Journal of Education 27, 2 & 3, 291–310

Poland, B. D. (1995). Transcription quality as an aspect of rigor in qualitative research. Qualitative Inquiry, 1, 290–310.

Gardner, Rod. (2001). When Listeners Talk: Response Tokens and Listener Stance. John Benjamins Publishing Company.

More examples

Psychotherapy Verbatim Transcription Guide

Date: March 10, 2015 Author: Isaac

A while back I wrote a post about general verbatim transcription convections that we use.

Verbatim transcription services of a counseling session using the psychotherapy transcription standards by Mergenthaler and Stinson (1992).

The first step was to create a transcription guide that followed the standards published in 1992, then updating the guide for 2021. Luckily, there were very little adaptation that needed to be made, and the client had followed our advice (check out this post) and recorded high quality audio.

Here’s the guide that we created, closely following Mergenthaler and Stinson (1992), and hopefully it is of use to you.

Psychotherapy Verbatim Transcription Guide
What To Transcribe?

Verbal utterances: All words spoken as whole words or parts of words are to be reproduced in standard spelling. Dialect forms should be transcribed in their corresponding standard spelling forms. For example, if an English speaking person’s usual speech sounds like the following:

P: I know she ain’t gonna gimme lotsa trouble.

it should be transcribed using standard English spelling as follows:

P: I know she ain’t going to give me lots of trouble.

Note that the word “ain’t,” although substandard, is retained in its standard dictionary spelling. For transcribing instances where a speaker deliberately uses dialect forms signaling emphasis or humor, see below.

Paraverbal Utterances. All sounds or sound sequences serving as conversational gap fillers, expressions of feelings of doubt, confirmation, insecurity, thoughtfulness, and so on in English are written in the following standard spelling whenever possible (modified from Dahl, 1979):

  • Affirmative: mm-hm, uh-huh, yeah, yup
  • Negation: huh-uh, nah, uh-uh, hm-mm
  • Noncommittal: hm, mm
  • Hesitations: ah, eh, em, er, oh, uh, urn
  • Questioning: eh, huh, oh
  • Humor: ha, haha, ho, hoho
  • Exclamation: ach, aha, ahh. bang. boom, ech, kerbang, oh. ooh, oops, ow pooh, pow, uch, ugh, wham, whew, whomp, whoo, whoops, whoosh, whop, wow

Additions to this list might be needed.

Nonverbal Utterances. All other noise-producing actions of the speaker are to be recorded where they occur in the text in the form of simple comments within parentheses:

P: (sneeze)(cough) well (sigh), I guess I caught a cold (laugh).

Noises Occurring in the Situational Context: Any other sounds produced by the situational environment are indicated within simple comments:

P: later when I (telephone rings): do you need to answer that?

Pauses. You may use a single dash character surrounded by spaces ( – ) to indicate a pause of approximately one second. Multiple dashes should be separated by spaces. Pauses of greater than approximately 5 seconds should not be indicated with dashes, but should be timed and indicated using the following coded comment form:

P: I can think of – -, nothing (p:00:03:35) nothing at all.

The example above indicates a pause of approximately 2 seconds and a second pause of 3 minutes and 35 seconds.


Incomplete Words. Word particles generated by word break, including stuttering and stammering are to be indicated by the word fragment followed by a hyphen(-) and a space. A broken word is defined as an incomplete word that is not repeated:

P: whenev- I can never visit them alone.

Stuttering is defined as: 1 ) one or more word particles, each sharing the initial letters of the following completed word; or 2) a sequence of more than one word particle, each particle sharing initial letters, but not followed by the completed word:

P: sh- sh- she t- t- t- asked me not to call her again

Indecipherable utterances. A single slash (/) is entered in the transcript for every utterance that cannot be clearly comprehended but can be distinguished as a separate word. A slash marking an incomprehensible word may be followed with a coded comment of the form “(?:word)” to indicate possibly correct words. Thus the”?:” indicates that the comment contains a word or words that may have been uttered by the speaker:

P: I was /(?:alone) there all I(?:night) / until be / / /.

If one cannot determine the number of words in an utterance or any of the possible words, this should tie simply indicated with the following comment:

P: (incomprehensible)

Quotations. If the speaker directly quotes prior discourse, the text for each speaker is enclosed in single forward quotation marks (‘), which is the same character as the apostrophe:

P: I asked ‘will you do it?’ and be yelled ‘stop talking to me like that’ and slammed the door.

Changes in Manner of Speaking. If the speaker changes his or her usual manner of speaking and uses a voice differing from the usual way of speaking, the words are enclosed between double quote character (“). In such double quoted text, slang and literal transcription may be used.

P: she tells me not to say “yawl come back now” and “gimme that” what does she think this is grammar therapy?

Punctuation. Punctuation markers are used to help the reader reconstruct the original flow of speech. They are not used according to traditional grammatical rule, because normal speech is rarely so well-ordered. The transcriber should use punctuation marks to indicate changes in the way of speaking. emphasis, intonation, and cadence. When in doubt, punctuation marks should not be used. Punctuation markers are always placed at the end of a word and should not split a word. The following situations are differentiated:

  1. Completion of a thought .The clear period (.) indicates the end of a completed thought and is usually accompanied by a drop in pitch.
  2. Broken thought. The semicolon (;) indicates a broken thought, followed by another thought for example:

P: I hate the way you; did I tell you about the wedding?

  1. Hesitation. The comma (,) indicates a hesitation followed by a continuator of the same thought and is usually accompanied by a slight drop in pitch, for example:

P: you, netter seem, to look at me when I am talking.

  1. Question. The question mark(?) indicates a question. usually accompanied by a rise in pitch, or a clear rise in pitch. It should be used at the end of possible questions indicated by a rise in pitch even if the statement does not contain a clear grammatical question form:

T: Do you dislike it when he does that?

P: I should like! it when he does that?

  1. Emphasis. The exclamation mark (!) immediately follows words clearly emphasized by the speaker as in the prior and following examples:

P: that may not matter to him! but I do not! like it.

Note that the exclamation mark in transcription is used only for emphasis and does not indicate the end of a grammatical sentence.

  1. Lengthened pronunciation. The colon (:) is not used in its traditional grammatical way but is used to indicate protracted or extended pronunciation of a word as in the following example:

P: Well: I never really: liked that much anyway.


Transcript Heading. The transcript should contain a header. The following example shows the types of information that one may wish to include. The entire set of information should be enclosed in parentheses as a comment:

(SUBJECT ID: 105, SESSION NO: 32 DATE: 29.SEP. l986, THERAPIST; Dr. Smith, TEXT TYPE: psychoanalytic session, Version No: 1.0)

Speaker Codes. Each turn of speech begins on a new line and is preceded by a code indicating the speaker. Speaker codes are of the format Xn: wherein X is a single letter indicating the speaker’s role and n is an optional digit ( if there is more than one speaker of a certain role). If n is omitted it is assumed to be the digit 1. Thus, in the following example;

T: how did that make you feel?

P1: I felt confused and angry.

P2: you never told me you were angry about that.

the first speaker T is a therapist and P1 and P2 are two patients. The speaker code T has an implicit digit component of 1 and is therefore the same as T1. This format can handle monologue, dialogue, individual therapy, group therapy, and single therapists or cotherapists.

A comment after the transcript header can be used to clarify the role of speakers, for example:

(P =Son, P2 =.Mother, PJ =Father, Tl =Therapist, T2 = Cotherapist)

Capitalization. With the exception of proper or personal names or the first person pronoun “I,” all words including the first letter of a sentence begin with lower-case letters. This enables the use of even the simplest word-counting programs.

Simultaneity: Simultaneous speech presents special problems, both for comprehension and for representation of text. For two speakers however this can be easily handled by inserting a plus sign ( + ) at the start of simultaneous speech and continuing transcription of the initial speaker until simultaneity ends. This is followed by the entire simultaneous speech of the second speaker and terminated by another “+”. The remainder of the non simultaneous speech is to be transcribed in its natural order. In the following example, the words “refused again” and “yes you” were spoken at the same time:

P: I was going to give John the map but he +refused again

T: yes you + have told me this once before

Compound Words. Compound words with standard hyphenated spellings are connected by hyphens without spaces:

P: I found the picture taped upside-down on the wall with a band-aid

Neologisms. Neologisms are spelled as best as possible. Words that are created by stringing other words together should be represented with hyphenation:

P: all this gaming-it-out is confusing me.

Word Division at the End of a Line. If the text is for computer-aided text analysis, words should not be split at margins using hyphens (this creates problems for some computer-aided text analysis tools); the word should be typed in full on the next line.

Contractions. The apostrophe (‘) should be used to indicate contractions:

P: it’s not fair that they’d get to go and I wouldn’t

Text analytic systems can then treat the two parts separated by the apostrophe as separate words (e.g., wouldn’t becomes wouldn, which can be treated as would, and t, which can be treated as not). If a contraction produces ambiguous parts, either the words should be spelled out completely or else the ambiguous parts should be followed with a slash and the clarifying word (or words connected by a hyphen without spaces as described above) as in the following example:

P: he’d/had not done it and he’d/would never do it

In the first case d stands for had and in the second case d stands for would. Without the additional information following the slashes the two d’s would be processed as the same word. If’s is not clarified, it should be assumed to represent the word would. Do not use the apostrophe to indicate aphesis (the omission of letters at the beginning or end of a word). The word ’cause, for example, should be spelled out in its standard English form because. Do not use the apostrophe to indicate the possessive case. Instead of such forms as Mary’s and John’s one should transcribe a follows:

P: that coat is Marys and this one is Johns.

Plurals. The apostrophe should not be used to indicate plurals of letters, numbers, acronyms, or abbreviations. The underscore can be used for clarity, if necessary:

P: he always got As because he was the teachers pet

P: she only types lower case a_s because her typewriter is broken.

Abbreviations. With the exception of formal titles abbreviations are not to be used unless the speaker verbally spells one. Periods are not used in abbreviations; use a space instead:

P: Mrs Smith thinks I made a terrible mistake, for example

T: mm-hm.

P: and it irritates me that Jane always says “e.g”.

Numbers, Fractions, and the Like. Numbers and fractions are written out in full where possible. Only typical figures such as dates are transcribed as numbers. The “abbreviations for ..ante meridiem” and “post meridiem” should be capital letters without spaces (AM and PM):

P: in 1981 I saw the first two-thirds of a James_Bond_007 film at eleven-thirty PM for two dollars and fifty cents.

Mistakes. Slips of the tongue and other mistakes are to be transcribed in full:

P: I couldn’t stand the guilt, uh quilt she gave me for my birthday.

Correct Spelling. Spelling should follow Webster’s standards.

Where several marking rules apply, it is necessary to include them all in sequence, with a period or question mark going last:

P: he screamed ‘don’t shoot until you see the whites of their eyes’!.

Some Things To Avoid. Do not use a sequence of periods (…) to indicate ellipsis. Do not use special characters (such as { } ) unless needed for special purposes of your own.


The following set of rules can be of help for research settings with special need.

Names. If confidentiality is an issue, pseudonyms may replace personal names, names of places and other identifiers. To signify that a name has been changed, precede it with an asterisk (*) without an intervening space. It is proposed that a separate list of substituted words be maintained and used consistently throughout all material transcribed for the same speakers:

P: *Jane told *Fred all about *Elliot and *Mary.

If more than one word is needed to replace a single word, the multiple substituted words should be joined by underscore characters (-) without intervening spaces. This enables the entire substitution to be counted as a single word in the case of subsequent computer text analysis:

P: *Albert changed his name and moved to *small_southwest_town.

If a title is to be used before a name. it should be separated from the name with a space. Apostrophes should be omitted from names containing them; hyphenated names should retain the hyphens. Names (even those not substituted by pseudonyms) should be joined with underscores to form a single entity:

P: Mr *Arnold_O_Malley wants to be on Hollywood_Squares and meet Eva_Gabor.

Date and Time Coding. The date, time of day, and elapsed times of a transcript may be inserted using special coded comments.

  1. Session date: The session date is indicated with a coded comment of the following form:


The d: indicates the comment is a session date. The date is entered in the format “DD.MON.YEAR.. (a two digit representation of the day of the month, a three-capital­ letter abbreviation of the month, and a four-digit representation of the year, separated by period without spaces). Thus “(d:06.MAR.1986)” represents “March 6, 1986” The session date should he placed at the top of the session transcript just after the heading (note that the form of this code makes it accessible to computer systems). If the exact date is unavailable, the unknown information should be replaced by zero’s

  1. Time of Day: The beginning of session time is indicated with a coded comment in the following example:


The t: indicates the comment is the actual time of day of the session, if available. All*time codes are in the format “HH:MM:SS” (two-digit representations of hour, minute, and second each separated by a colon). (Some facilities may also allow the notation of Video frames also. in which case the time codes would be in the format “HH:MM:SS:FF”; if this is used, it should be clearly indicated in a simple comment at the beginning of the transcript.) Thus “10:02:I 5” represents 2 minutes and 15 seconds after the hour of 10 O’clock. It is preferable to use 24-hour clock time. The session time should be placed at the first of the session transcript on the line following the session date. If the exact time is unavailable, the unknown information should be replaced by 0’s.

  1. Elapsed Time. It is often helpful to insert elapsed time codes in a transcript. The relative time within a session is indicated with a ended comment of the following form:

P: we saw the movies (+:00:03:00) after dinner.

The”+:” indicates the comment contains the elapsed time since the beginning of the session. The “00:03:00” indicates this is the start of the third minute following the beginning of the session. If the minute changes in the middle of a word, the time code should be placed before that word. The interval between relative time codes (if they are to be used at all) depends on the nature of the study. For example, these codes can be used to relate the text to other temporally ordered data (e.g. physiological recordings). These might be placed at the beginning and end of specific events or they might be placed at regular intervals. such as every whole minute or every 5 minutes.

Ambiguity. Some statements may be ambiguous in print yet unambiguous when heard in a sound recording. It is to the advantage of both computer-aided analysis and human readers to convert such ambiguous utterances into unambiguous ones. A clarifying alternative word may be placed behind a slash (/). Alternatively, a number placed immediately after the slash can be used to indicate the index number of a word’s meaning in a specific content-analytic dictionary. In the case of ambiguous pronouns, it is possible to name the antecedent behind the slash or to include several words connected by hyphens (this rule is primarily for use during the verification and scientific annotation phases of transcript preparation):

P: we/group thought he/James_Joyce had ignored it/rules-of-the-game.

Segment Demarcation. Various segmentations of the transcript may he accomplished by using coded comment structures to indicate the start of a segment “(s:CODE)” and the end of a segment “(e:CODE)” These two are used to bracket a segment of type indicated by “CODE,” e.g., “DREAM.” Whatever word is substituted for “CODE” must be spelled exactly the same in both start-segment and end-segment coded comments. It is permissible for different segment types to overlap or embed. This approach can be used for many types of segmentation. One might segment by relationship episodes and dreams, as in the following example:

P: (s:RE) When I told *Jane that last I dreamed (s:DREAM) I was a butterfly (e:DREAM) she laughed (e:RE).

The coded comments “(s:RE)” and “(e:RE)” indicate the beginning and end of a relationship episode, respectively. The coded comments “(s:DREAM)” and “( e:DREAM)” indicate the beginning and end of a dream description. Segment demarcations of the same types must not overlap or be embedded. This will not usually be a problem.


I have included the following mock interview formatted according to the transcription standards described above.

Several examples of problems typically encountered in preparation of psychotherapy transcripts for research and education purposes are shown:




(T = Dr. Jones, P = John Doe)

T: can you, do you recall similar episodes in your /(?:adolescence) of / /? (microphone drops) you um, mentioned that uh an important person for you during your adolescence was your school teacher Miss *Green.

P: mm-hm.

T: is there a particular incident that stands out in your mind or ( + :00:01:00) interaction between the two of you that stands out +in your mind that

P: urn I think that I can+ credit her for – – sort of turning me around uh academically, you know, because I was pretty much of a, I would not work real hard. and I think that um there was always a recognition that I had some potential um to do well in school, but never did well.

T: ( + :00:02:00) mm-hm.

P: and uh my brother was absolutely brilliant and everybody loved him and he was valedictorian, and so I had to sort of uh come in his shadow through grammar school and high school, you know, because we went to the same schools. so this Miss *Green. she’d/had been his teacher. I remember her very clearly, I can picture her face very clearly. um and she decided, I guess, that I was not going to slide anymore. like t- to fool around a lot, you know.

T: mm-hm.

P: uh passing notes, goofing off, uh doing ( +:00:03:00) things, I mean, not serious. but Miss *Green, she knew all this, I’m sure. well. it carne grade time for the first marking period, and I knew enough never to get a C. and this one marking period she gave me two Ds!, uh ( + :00:04:00) one D in math um

T: hm.

P: so she gave me a D, and I was just terrified. um she handed the report card t- to me and looked at me urn for a long time. I remember that, and she made the D in red, in uh really thick: red: letter!’.

T: (incomprehensible)

P: and uh so I was just ( +:00:05:00) really flabbergasted. I didn’t know what I was going to do, because I knew to get a s- D was just awful.

T: mm-hm. do you remember what you were feeling a- a- at the time?

P: uh oh this really sinking feeling. like (child-like voice) “oh no”.

T: mm-hm.

P: I had really, really screwed up and didn’t know how 1 was going to get out of it. (p:00:00:40) (sigh) ( + :00:06:00 ) uh it just seemed like the worst thing in the world that could happen. um and I knew my parents were going to be upset. and I knew that um it would be hard to undo that, you know. so I was feeling really, really um now I was afraid, um I was really dejected by it, I mean, that uh that she had done this. because I, actually m-, I should mention, I didn’t feel I deserved a D.- – – I think she gave it to me to motivate me.

T: mm-hm. do you remember; what do you think, was going on in her ( +:00:07:00) mind at the ti-.

P: yeah, I r- I remember, because I said she looked at me for a long time when she handed my report card to me. her saying um, like around, it was after the report card that urn that she said ‘you: are not! going to get away with doing no work ‘ and and that I was really going to have to do well.

T: mm-hm.

P: uh to get grades in her classroom. and she reited, reiterated that again. and at the ( +:00:08:00) time I thought she was just the most stern. unreasonable person um, I mean, I do recall um really after that urn not liking her and uh. so I do think that she, but uh but see I think what she was doing was something really um very caring and very positive. I mean she had singled me out, I think, or maybe she did other people too, to really just get them on the ball.


Mergenthaler E, Stinson CH. Psychotherapy transcription standards. Psychotherapy Research 2(2) 125-142, 1992

Dahl. H. ( 1979 ). Word Frequencies of Spoken American English. Essex CT: Verbatim.0Article Rating

More examples

If you need the full transcription standards you can find them here

Erhard Mergenthaler & Charles Stinson (1992) Psychotherapy Transcription
Standards, Psychotherapy Research, 2:2, 125-142