We designed a framework for visual documentation that utilized cystoscopy videos and screenshots taken during the procedure to document clinically significant lesions. Screenshots function as a communication medium between the surgeon and the framework that will curate lesion-specific video content. Figure 1 summarizes the workflow of the framework for visual documentation of cystoscopic findings.
The study was approved by the Stanford University Institutional Review Board (Protocol #29838 and #36085) and informed consent was obtained from all study subjects. Eighty-six cystoscopy videos recorded during transurethral resection of bladder tumors (TURBT) performed at the Veterans Affairs Palo Alto Medical Center by a single surgeon (JCL) between 2016 and 2018 were reviewed. Each case had a full-length video and one or more screenshots with pathological correlation. This cohort was a subset of the dataset previously published by Shkolyar et al [10].
Videos and screenshots were recorded and captured on an integrated cystoscopy system capable for both white light and blue light cystoscopy (KARL STORZ D Light C Photodynamic Diagnostic System, Tuttlingen, Germany). In some cases, multiple video files were generated in a single procedure due to surgeon preference or interruptions in video recording. A clinical research coordinator was responsible for data repository and associating the screenshots with the video records for each case. The visual content of the videos and the screenshots did not include protected health information (PHI).
Data management
Screenshots taken during cystoscopy functioned as the data exchange medium between the clinical expertise and the framework. A standardized naming scheme was devised for the filenames of the screenshots to store information on pseudonymized case identification, date, pathology, lesion location, and the lesion number, as followed:
[case id][date]_[lesion location]_[lesion id]_[lesion description]_[image modality]
We utilized well-accepted ontologies to describe the entities in the urinary bladder[15–18]. The location was defined according to the anatomical scheme of the urinary bladder proposed by the European Association of Urology guideline for cystoscopy reporting [1]. The lesion identification (e.g., tumor1, tumor2) documents the unique lesions visible in the screenshots. The image modality was either white light cystoscopy or blue light cystoscopy, an enhanced photodynamic imaging frequently applied for a better visualization of suspicious bladder lesions [19].
We defined a hierarchical data storage for cystoscopy videos and screenshots as shown in Fig. 2 for systematized data access and scalability. Screenshots were stored in JPEG, BMP or PNG formats and videos were stored in MP4 format. Videos were taken in either 320p or 480p resolution. The data are stored on Stanford Medicine Box, an institutional HIPPA-compliant cloud-based storage solution [20] and is accessible to all study team members using a secured two-factor authentication.
Using a fault-tolerance exhaustive search algorithm written in Python, keywords in the screenshot file names were searched to extract label information for QR barcode generation and downstream data processing. For demonstration purposes, we selected keywords that describe the lesions according to the occurrences of various entities, thereby covering the most frequent benign and cancer pathologies as provided in Table 1.
Table 1
Keywords used to manually label screenshots in their file names. Stage and grade are defined according to the American Joint Committee on Cancer (AJCC)[21] and reported together in the filename.
Key words | Description |
inflammation | Acute and chronic Inflammation |
benign | Benign lesions, not specified |
dysplasia | Dysplasia |
| Bladder cancer stage |
Tis | Tis |
Ta | Ta |
T1 | T1 |
T2 | T2 |
| Grade |
lg | low grade |
lg_focal_hg | low-grade with focal high-grade |
hg | high grade |
AdenocarcinomaEntericType | Adenocarcinoma (Enteric Type) |
clot | Blood clot |
nobiopsy | Cystoscopic lesions with no corresponding biopsies, due to either low suspicion or fulgurated by surgeon decision. |
FPBLC | False-positive blue light cystoscopy (positive blue light signal indicating suspicion for cancer, but the corresponding appearance under the white light is obviously not suspicious). |
Content search algorithm
We utilized Mean-Squared Error (MSE) as a metric to measure the image similarity between screenshots and video frames[22, 23]:
$$MSE=s≔ f\left(R,F\right)=\frac{1}{w h c}{\sum }_{i=0}^{w-1}\sum _{j=0}^{h-1}\sum _{k=0}^{c-1}{[R\left(i,j,k\right), F\left(i,j,k\right)]}^{2}$$
1
where R is the matrix of any screenshot, F is the matrix of any frame image in a video, w is the image width, h the image width, and c the channel number of the image. Since cases can contain multiple screenshots from a single cystoscopy video, we generated a sequence of scores for each screenshot:
$${S}_{n}=\left\{{s}_{1,},{s}_{2,},{s}_{3,},\dots {s}_{l}\right\}$$
2
where \({S}_{n}\) is the score collection for a screenshot \(n\) for a frame sequence length \(l\). This can also be written as \({S}_{n}= {\bigcup }_{r=0}^{l}f({R}_{k},{F}_{r})\), where r is the frame index. The index frame with the lowest score (x) is the frame most similar to the screenshot, and was identified using the following formula:
$${argmin}_{x\in {S}_{n}} f\left(x\right) :=\{x \in {S}_{n} \wedge \forall s \in {S}_{n}: s\ge x\}$$
3
Next, a window size that considers the frames spanning 3 seconds before and 7 seconds after frameargmin was defined to increase the likelihood of including different views of the lesion.
Lesion content extraction
A Python script was written incorporating the algorithm to extract frame sequences including the most similar frame to the screenshot to generate short video clips for all screenshots. After extraction of the video clip a QR barcode containing the corresponding clinical and pathological information about the lesion, the video source, and the file source of the reference image was generated and appended to the video. This QR barcode facilitates review of the video and screenshot filenames to ensure the information was extracted correctly for quality control. The resultant video clip was stored in MP4 format and named according to the filename of the screenshot.
MSE scores were calculated after normalizing all color channels (red, green, blue) between 0 and 1. Since the same lesion can be found in different frame positions in a cystoscopy video, we ran the search algorithm through the entire video and considered the frame indices with the lowest MSE scores for the frame index selection. To prevent the selection of frames next to the index frame and to mitigate the likelihood of having video clips with duplicated visual content, we defined a lockout interval of 3 seconds before and 10 after the index frame. The script was configured to process 8 cases at the same time using parallel computing on an Intel processor with 9 cores and 32 GB RAM.
Utility
Quality control
A web portal was designed to allow expert urology reviewers to assess the content quality of the 10-second video clips securely using two-factor authentication. For each video clip, the reviewer obtained information about the original screenshot using the QR barcode [24] and verified whether the lesion seen in the screenshot was in the video clip. For quality control, the reviewer checked the video contents for its lesion presentation in different angles and the correct lesion selection. Finally, data completeness for pathology information in the barcodes for all video clips was assessed.
Application
As an application of our framework, we examined the feasibility of creating a cystoscopy atlas by presenting the video clips arranged by pathological diagnosis. To protect PHI and preserve HIPAA compliance, the QR barcode containing the source file information was replaced with a QR barcode including only the pathology information.
Additionally, we examined the feasibility of using the video clips to store annotation information and to reveal the label distribution for each frame in these video clips. The 10-second video clips were each manually annotated to denote the range of frames meeting the description of the keyword considered by the current study. The frames were annotated according to the keywords described in Table 1 using a customized annotation tool (https://github.com/oeminaga/VideoAnnotator). Any frame that did not meet any of the description of the keywords was considered as background. Annotation data was then stored in json format in the barcode of the video clips in addition to the pathology information.