Renal/Lung/Breast Cancer Region Detection and Subtyping Dataset

Complete Region Annotation

Areas that lie inside Red Lines are cancerous regions and vice versa

Minimal Point Based Annotation

The Red Point represents the marked position in the cancerous region

The Green Point stands for the non-cancerous region

The RCC dataset is derived from the TCGA database and contains three TCGA projects (i.e., KIRC, KIRP, KICH) and totally has 667 WSIs.

The LU dataset is derived from the TCGA database and contains two TCGA projects (i.e., LUAD, LUSC) and totally has 890 WSIs.

The BR dataset is derived from the TCGA database and contains two TCGA projects (i.e., BRDC, BRLC) and totally has 1000 WSIs.

  • These WSIs are scanned at 40x magnification and selected by experienced pathologists.

  • In each WSI, we have two annotation methods: Minimal Point-Based (Min-Point) annotation and Complete Region annotation.

  • The dataset provided here is for research purposes only. Commercial uses are not allowed.

  • If you intend to publish a research work that uses any of these datasets, you must cite our publication.

Minimal Point-Based Annotation Rules

  1. Equally mark points on both cancerous and non-cancerous regions. We set this number.

  2. Evenly distribute the points within the whole image.

  3. Do not mark points on the blank, edge, badly stained, damaged (man-made), and other abnormal areas.


Statistics

Annotation Time

Reduce the annotation time to roughly one-twentieth when compared to the complete annotation.

Details

Three large cancer classification datasets with two types of annotation, the test set is composed of two parts, i.e., patch-level test set for cancer region detection, WSI-level test set for subtyping

Papers

Renal Cell Carcinoma Detection and Subtyping with Minimal Point-Based Annotation in Whole-Slide Images

Zeyu Gao, Pargorn Puttapirat, Jiangbo Shi, and Chen Li

A Semi-Supervised Multi-Task Learning Framework for Cancer Classification with Weak Annotation in Whole-Slide Image

Zeyu Gao, Bangyang Hong, Yang Li, Xianli Zhang, Jialun Wu, Chunbao Wang , Xiangrong Zhang, Tieliang Gong, Yefeng Zheng, Deyu Meng, and Chen Li

Medical Image Analysis, 2022.

Applications

Cancer Region Detection Results (Heat Maps)

Subtyping Results - Blue(ccRCC), Red(chRCC), Green(pRCC)

Data Format

These three datasets are composed of two parts:

  1. Complete Region Annotation: are saved as PNG files, the region marked by white/red/green is the cancer region, and marked by blue is the abandoned region, which also can be regarded as background.

    • The PNG file name is the UUIDof each WSI. Match the original TCGA name by "slide_list.txt".

    • The size of each annotation mask is equal to level 3 of the corresponding WSI.

  2. Min-Point Annotation: is saved as a TXT file with four columns (i.e., wsi, x, y, and label), indicating the TCGA name of WSI, the x and y coordinates of each point, and the label of each point.

    • For the label, 0 and 1 represent the "cancer" and "normal" class, respectively.

    • The x and y coordinates of each point belong to level 0 of the corresponding WSI.

Note that, Only annotation data is available here, the original WSIs (SVS file) need to be downloaded from the TCGA portal. The labeled and unlabeled image patches need to be cropped from the original WSIs based on the annotation data.

RCC dataset

Breast dataset

Lung dataset