An Efficient Framework for Video Documentation of Bladder Lesions for Cystoscopy: A Proof-of-Concept Study

Processing full-length cystoscopy videos is challenging for documentation and research purposes. We therefore designed a surgeon-guided framework to extract short video clips with bladder lesions for more efficient content navigation and extraction. Screenshots of bladder lesions were captured during transurethral resection of bladder tumor, then manually labeled according to case identification, date, lesion location, imaging modality, and pathology. The framework used the screenshot to search for and extract a corresponding 10-seconds video clip. Each video clip included a one-second space holder with a QR barcode informing the video content. The success of the framework was measured by the secondary use of these short clips and the reduction of storage volume required for video materials. From 86 cases, the framework successfully generated 249 video clips from 230 screenshots, with 14 erroneous video clips from 8 screenshots excluded. The HIPPA-compliant barcodes provided information of video contents with a 100% data completeness. A web-based educational gallery was curated with various diagnostic categories and annotated frame sequences. Compared with the unedited videos, the informative short video clips reduced the storage volume by 99.5%. In conclusion, our framework expedites the generation of visual contents with surgeon’s instruction for cystoscopy and potential incorporation of video data towards applications including clinical documentation, education, and research.


Introduction
Cystoscopy is an endoscopic procedure that visualizes the interior of the urinary bladder to facilitate diagnosis and intervention [1] and is one of the most common procedures in urology [2]. Clinical documentation of cystoscopy is traditionally limited to a written description of cystoscopic findings along with pertinent clinical history and plays a vital role in optimizing patient care [3].
Contemporary cystoscopic equipment has image and video acquisition capability that can supplement written cystoscopy report. Taking screenshots of suspicious lesions during cystoscopy is commonly performed to facilitate longitudinal surveillance, assess treatment response, and inform patients of relevant findings. Storage of cystoscopy images and videos, however, is uncommon due to lack of efficient infrastructure for storage and retrieval of large videos files. In addition to visual documentation, capturing cystoscopy images and videos could be useful for curation of educational cystoscopy atlases [4,5] and the development of computer-aided solutions for augmented endoscopic imaging [6][7][8][9][10][11]. Furthermore, visual data is a fundamental resource for developing artificial intelligence-based solutions in medical imaging [12].
The current practice for visual documentation of cystoscopy typically involves recording the entire procedure and taking still screenshots of concerning lesions as directed by 1 3 the surgeon. However, data management of visual content in the clinical setting is largely unstructured and disorganized. Particularly for full length videos, the need for high-volume data storage and time-intensive searches to find and extract the relevant frame sequences lead to inefficient data management and significant amount of information waste. Although images are easier to store and navigate, they are static and lack spatiotemporal information. The poor data integrity of clinicopathological information and annotation is another common issue in the visual documentation of cystoscopy that limits the secondary use of cystoscopic materials for educational and research purposes [13,14].
Given the limitations to visual documentation of cystoscopy, we have developed a framework for provider-guided visual documentation. Our framework preserves spatiotemporal information and improves data integrity by linking pathological or annotation information with the corresponding visual contents in an efficient and decentralized manner.

Material and methods
We designed a framework for visual documentation that utilized cystoscopy videos and screenshots taken during the procedure to document clinically significant lesions. Screenshots function as a communication medium between the surgeon and the framework that will curate lesion-specific video content. Figure 1 summarizes the workflow of the framework for visual documentation of cystoscopic findings.
The study was approved by the Stanford University Institutional Review Board (Protocol #29,838 and #36,085) and informed consent was obtained from all study subjects. Eighty-six cystoscopy videos recorded during transurethral resection of bladder tumors (TURBT) performed at the Veterans Affairs Palo Alto Medical Center by a single surgeon (JCL) between 2016 and 2018 were reviewed. Each case had a full-length video and one or more screenshots with pathological correlation. This cohort was a subset of the dataset previously published by Shkolyar et al. [10].
Videos and screenshots were recorded and captured on an integrated cystoscopy system capable for both white light and blue light cystoscopy (KARL STORZ D Light C Photodynamic Diagnostic System, Tuttlingen, Germany). In some cases, multiple video files were generated in a single procedure due to surgeon preference or interruptions in video recording. A clinical research coordinator was responsible for data repository and associating the screenshots with the video records for each case. The visual content of the videos and the screenshots did not include protected health information (PHI). Fig. 1 Workflow of the framework for visual documentation of cystoscopy. A A screenshot of a lesion of interest taken during cystoscopy. After the procedure, the full-length cystoscopy video is saved, and the screenshot file is labeled with the corresponding location of the lesion in the bladder. Once the pathological diagnosis is available, the information is added to the screenshot filename. B The search algorithm is applied to the videos to generate short video clips containing the identified lesion by estimating the mean square error (MSE). The MSE measures the similarity between the corresponding screenshot and the video frames. Then, we extracted the frame sequence 3 s before and 7 s after the frame with the lowest MSE that is most similar to the corresponding screenshot. Finally, a one-second space-holder containing a QR code is added to the beginning of each video clip. The QR code contains pathologic information, frame annotation, and the screenshot and video resources of these video clips for quality control

Data management
Screenshots taken during cystoscopy functioned as the data exchange medium between the clinical expertise and the framework. A standardized naming scheme was devised for the filenames of the screenshots to store information on pseudonymized case identification, date, pathology, lesion location, and the lesion number, as followed: We utilized well-accepted ontologies to describe the entities in the urinary bladder [15][16][17][18]. The location was defined according to the anatomical scheme of the urinary bladder proposed by the European Association of Urology guideline for cystoscopy reporting [1]. The lesion identification (e.g., tumor1, tumor2) documents the unique lesions visible in the screenshots. The image modality was either white light cystoscopy or blue light cystoscopy, an enhanced photodynamic imaging frequently applied for a better visualization of suspicious bladder lesions [19].
For demonstration purposes, we selected keywords that describe the lesions according to the occurrences of various entities, thereby covering the most frequent benign and cancer pathologies as provided in Table 1.
We defined a hierarchical data storage for cystoscopy videos and screenshots as shown in Fig. 2 for systematized data access and scalability. Screenshots were stored in JPEG, BMP or PNG formats and videos were stored in MP4 format. Videos were taken in either 320p or 480p resolution.
The data are stored on Stanford Medicine Box, an institutional HIPPA-compliant cloud-based storage solution [21] and is accessible to all study team members using a secured two-factor authentication.
Using a fault-tolerance exhaustive search algorithm written in Python, keywords in the screenshot file names were searched to extract label information for QR barcode generation and downstream data processing.

Content search algorithm
We utilized Mean-Squared Error (MSE) as a metric to measure the image similarity between screenshots and video frames [22,23]: where R is the matrix of any screenshot, F is the matrix of any frame image in a video, w is the image width, h the image width, and c the channel number of the image. Since cases can contain multiple screenshots from a single cystoscopy video, we generated a sequence of scores for each screenshot: where S n is the score collection for a screenshot n for a frame sequence length l . This can also be written as (2) S n = s 1, , s 2, , s 3, , … s l Table 1 Keywords used to manually label screenshots in their file names. Stage and grade are defined according to the American Joint Committee on Cancer (AJCC) [20] and reported together in the filename

Key words Description
where r is the frame index. The index frame with the lowest score (x) is the frame most similar to the screenshot, and was identified using the following formula: Next, a window size that considers the frames spanning 3 s before and 7 s after frame argmin was defined to increase the likelihood of including different views of the lesion.

Lesion content extraction
A Python script was written incorporating the algorithm to extract frame sequences including the most similar frame to the screenshot to generate short video clips for all screenshots. After extraction of the video clip a QR barcode containing the corresponding clinical and pathological information about the lesion, the video source, and the file source of the reference image was generated and appended to the video. This QR barcode facilitates review of the video and screenshot filenames to ensure the information was extracted correctly for quality control. The resultant video clip was stored in MP4 format and named according to the filename of the screenshot.
(3) argmin x∈S n f (x) ∶= {x ∈ S n ∧ ∀s ∈ S n ∶ s ≥ x} MSE scores were calculated after normalizing all color channels (red, green, blue) between 0 and 1. Since the same lesion can be found in different frame positions in a cystoscopy video, we ran the search algorithm through the entire video and considered the frame indices with the lowest MSE scores for the frame index selection. To prevent the selection of frames next to the index frame and to mitigate the likelihood of having video clips with duplicated visual content, we defined a lockout interval of 3 s before and 10 after the index frame. The script was configured to process 8 cases at the same time using parallel computing on an Intel processor with 9 cores and 32 GB RAM.

Quality control
A web portal was designed to allow expert urology reviewers to assess the content quality of the 10-s video clips securely using two-factor authentication. For each video clip, the reviewer obtained information about the original screenshot using the QR barcode [24] and verified whether the lesion seen in the screenshot was in the video clip. For quality control, the reviewer checked the video contents for its lesion presentation in different angles and the correct Fig. 2 Hierarchical file storage for screenshots and videos. PHI is blacked out. A root folder of the case was named with the case identification number and the date of transurethral resection of bladder tumors. In each case folder, we have one folder "Images" to store screenshots and one folder "Video(s)" to keep video records. The pathology and surgery reports are optionally stored in the case folder. The required information was extracted from the file name of the screenshots using the keyword search 1 3 lesion selection. Finally, data completeness for pathology information in the barcodes for all video clips was assessed.

Application
As an application of our framework, we examined the feasibility of creating a cystoscopy atlas by presenting the video clips arranged by pathological diagnosis. To protect PHI and preserve HIPAA compliance, the QR barcode containing the source file information was replaced with a QR barcode including only the pathology information.
Additionally, we examined the feasibility of using the video clips to store annotation information and to reveal the label distribution for each frame in these video clips. The 10-s video clips were each manually annotated to denote the range of frames meeting the description of the keyword considered by the current study. The frames were annotated according to the keywords described in Table 1 using a customized annotation tool (https:// github. com/ oemin aga/ Video Annot ator). Any frame that did not meet any of the description of the keywords was considered as background. Annotation data was then stored in json format in the barcode of the video clips in addition to the pathology information.

Results
Our framework was successfully implemented without exposing PHI and in compliance of HIPAA regulations. A total of 238 screenshots were available from 86 cystoscopy procedures on 78 patients; the full-length videos were processed (i.e., search for frames close to screenshots) at a speed of 43 ± 20 frames per second (FPS), depending on the number of the screenshots to process per video. For instance, a 51-min raw video at 240p and 30 FPS containing 2 lesions (screenshots) took 62.14 min to process by a search speed of 24.6 FPS.
Our framework extracted the video clips from the screenshots with an error rate of 3.3% (8/238 of screenshots). Through our QC process, 14 video clips derived from 8 screenshots were excluded for not accurately reflecting the content of the screenshot. In depth review of these 14 clips demonstrated that they originated from screenshots labeled with keywords 'Ta,' regardless of grade, or 'nobiopsy'. These screenshots were dark images that were taken under blue light cystoscopy, which has a darker background, or blurry images taken with the cystoscope positioned too close to the lesion. The resultant erroneous video clips for dark images occurred when a dark screenshot had an MSE similar to a video frame with the cystoscope located outside of the patient body. Blurry screenshots tended to have an MSE similar to other frames when the camera was out of focus.
The resulting 249 video clips from 230 screenshots for 143 lesions represent a wide range of cancerous and benign lesions of the bladder (Table 2). Additional 19 video clips were extracted because the additional video clips were originated from the distant frame sequences containing frames similar to the corresponding screenshot. The 230 screenshots were all identified within their corresponding videos, thereby achieving a data completeness of 100% for lesion pathology in these videos.
We associated the label information with the corresponding video clips and build the web portal for quality control as shown in Fig. 3 and Supplementary file 1. The internal web portal provided the web link to each clip for external sharing via email or barcode format.
For all video clips, we labeled the frame sequence for lesion and the information integrated into the video was enough to curate a cystoscopic atlas. Figure 4 shows the distribution of the cystoscopy frames containing a lesion of interest (LOI) and background over the frames of the clip videos, demonstrating that the majority of the labeled frames contain the LOI.
Moreover, we were able to use the video clip as storage for annotation data using the barcodes of the space holder as shown in Fig. 5, so that the video clip incorporated images and the corresponding image labels ready for data distribution (An example video clip with the barcode containing the annotation data is provided in the supplementary).
Using our framework to store the lesion content of a cystoscopy video as short video clips reduced the data storage requirements by 99.5%. The full-length raw videos (cumulative length: 1,821 min) totaled 47.3 GB in

Discussion
This study illustrates a semi-automated framework to improve the visual documentation of cystoscopic findings. We showed that screenshots of bladder lesions routinely taken by surgeons can be used by our framework to automatically generate concise video clips of pertinent cystoscopic findings.
Reducing data storage requirements becomes even more of a concern as cystoscopic videos are recorded at increasingly higher resolution, as a 1080p video is approximately 4 times the size of a 480p video. More importantly, this framework allows the surgeon to review cystoscopic videos in a shorter amount of time, by focusing data on the lesions of interest. Cystoscopic videos can often contain a large amount of clinically irrelevant information (e.g., time that the instrument is out of the body for cleaning or equipment changes), or negative cystoscopic findings (e.g., no lesion of interest in frame). This framework can help integrate video data, which is typically cumbersome to navigate, into routine clinical workflow to augment the amount of cystoscopic data at hand or use them for physicianpatient communication. The current framework of cystoscopy documentation is text-based (within the surgery or procedure report and the pathology report). However, textual description is limited and influenced by the perception bias of the surgeon, as supported by previous studies on the limits of human perception in describing image content [25,26]. Additionally, it has been shown that non-visual categorical word descriptions of object shape (e.g., 'papillary', 'sessile') influence how a person visually processes the visual medium [25,27].
Accordingly, to reduce the potential bias on visual processing, we labelled the video content with objective information verified by pathology (e.g., pathologic grade and stage) or obvious visual appearance (e.g., blood clot), while we omitted subjective descriptors of the lesion appearance (e.g., papillary, flat, texture descriptions) in the video labelling.
The proposed framework further facilitates data decentralization and preservation of data integrity for documentation and data sharing by storing the label information  (Table 1) using the customized annotation tool. Background means the frames not meeting any of the categories of Table 1. One second corresponds to 30 frames. The change in the background distribution over the frames indicates the variation of video contents over the time, thereby reflecting the dynamic of the camera view during the 10 s TaHG  T1HG (benign) TaLG Tis A B Fig. 5 Representative examples of A lesions representing the major categories of bladder cancer and B barcodes including the corresponding annotation data for each lesion generated by the framework. The barcodes are also displayed in the first second of each video clip to facilitate decentralized data annotation for data sharing that stores time interval of lesion appearance in the short video clip (attributes: StartTime and EndTime; time unit: seconds). Please use the camera function of your smart phone to unravel the annotation content of the barcode. The annotation data are stored in json format and generated by our custom annotation tool (A source of the annotation tool is available in https:// github. com/ oemin aga/ Video Annot ator). The supplementary section provides an example video that includes the annotation data in the barcode inside the video clip as a QR code. Studies in the medical imaging domain generally use metadata sheets to label and associate the annotation information with the corresponding visual contents. However, this approach inherits a higher risk of losing data integrity when the metadata sheets are no longer available. In contrast, our solution manages each data point (video clip) as an independent storage unit to support decentralized data management and eliminate the need for metadata sheets to utilize a data point. Decentralized data facilitates better framework scalability and data distribution compared to rigid and centralized data collection, be integral part of de-centralized data solution [28].
Prior studies of deep learning for cystoscopy images utilized still images, either captured directly by surgeons or extracted from cystoscopy videos to curate a cystoscopy atlas [6][7][8][9][10][11]. In these studies, the collection of still images for cystoscopy data followed a predefined selection criteria of frames (e.g., a single frame with the best representation of the lesion, every 10 th frame, the first 30 frames) to reduce data dimensionality and redundancy; however, this approach considers cystoscopy data to be static, which is not reflective of the real-world condition and the dynamic streaming content, and therefore limits the ability to generate generalizable and relevant data.
There are certain limitations that should be considered. First, the implementation of the framework is dependent on the existing technical infrastructure and the workflow for data management, the availability of videos and screenshots and therefore requires to be adjusted according to the existing infrastructure. We applied retrospective data that may reflect the limitations of any retrospective study [29] including the selection bias. Nevertheless, we examined our framework for screenshots representing a rare histology type of bladder cancer, the putative precursor of bladder cancer (dysplasia), and other clinical situations (i.e., the existence of clot in the bladder, lesions not biopsied) for generalization. Our work represents a proof-of-concept study and the framework concept as well as the tools developed as part of this study are prototypes and therefore subject to continuous improvement. Finally, we chose the MP4 format for the video clips, as this can run on all multimedia players and operating systems, while medical imaging systems use DICOM format as their standard. Nevertheless, it is simple to convert video clips to DICOM format [30,31].
Overall, our work provides insight into the challenges of video-based clinical documentation and proposes a solution to manage the visual documentation of cystoscopy. Our approach can be used to curate cystoscopic atlas and datasets for research and clinical documentation. Future works will focus on developing a video player tool that reads the annotation data from the QR barcode and overlays it for clip presentation and merging artificial intelligence and the proposed documentation framework to improve the overall documentation workflow for cystoscopy. Finally, we will prospectively explore artificial intelligence-based documentation solutions for cystoscopy to automate the generation of short videos and the annotation data.

Conclusion
The current work introduces and validates as proof-ofconcept a framework for surgeon-instructed video documentation of cystoscopy which maintains spatiotemporal information and supports de-centralized data management.