Cleaning your zoom transcripts with Python

Alex Gregorie
2 min readOct 1, 2024

--

If you do UX research you may have a handful of recordings and transcripts sitting around. These are invaluable tools for capturing user quotes and ideas to take back to developers. However working with Zoom’s transcripts can be very messy. Between line numbers, timestamps, sentence splitting it can be very difficult to leverage these resources in something like Miro or Lucidspark.

To resolve this problem I took a look around the internet and found a script by Mighty Minh (source) and while it didn’t solve my problem it gave me a good headstart. A few modifications later and I was able to write my own script that can:

  1. open a zoom transcript
  2. remove the junk lines like headers, line counters, & timestamps
  3. then join the remaining lines together by speaker

This transformation can help make it much easier to grab pull quotes and insights from my customer interviews. I run this in colab as I haven’t been able to get the terminal arguments to work like I want quite yet.

To use the latest version of this script head over to my github and if you want to suggest any changes feel free!

import re
import os
import sys

def clean_transcript(file, newfile):
f=open(file, "r")
lines=f.readlines()
f.close()

newlines = []
speaker = ""
for l in lines:
nameregex = r"\w+\s\w+:"
if re.match(nameregex,l):
if l.split(":")[0] != speaker:
speaker = l.split(":")[0]
newlines.append("\n\n" + l)
else:
newlines.append(l.split(":")[1][1:])

transcript = "".join(newlines)
text1= re.sub(r'(?<=[a-z., ]{2})\n(?!\n)', ' ', transcript)

f=open(newfile, "w")
f.writelines(text1)
f.close()

#replace the names here to point to your transcript file and then what ever you want to name you new file.
clean_transcript("mytranscript.transcript.vtt","clean_file.txt")

--

--

Alex Gregorie
Alex Gregorie

Written by Alex Gregorie

A UX Designer in Atlanta focused on mentoring, modular UI and using python as a research method. www.alexgregorie.com

No responses yet