KBSMC colon cancer grading dataset repository

This repository provides the KBSMC colon cancer grading dataset that has been introduced in the paper:

Trinh Thi Le Vuong, Kyungeun Kim, Boram Song, Jin Tae Kwak,

"Joint Categorical and Ordinal Learning for Cancer Grading in Pathology Images",

which is published in Medical Image Analysis, 2021, 73. (Code)

Download

Google drive:
Training + validation + testing I (4.35 GB): [link]
Training + validation + testing I (Resize to 512) (1.55 GB): [link]
Testing II (Independent test set) (64.71 GB): [link]

Brief description

The tissue images and annotations are provided by Kangbuk Samsung Hospital, Seoul, South Korea.
Two pathologists have delineated the annotation: Kim, Kyungeun, and Song, Boram.
Herein, we obtained the benign (BN) and three cancer ROIs, including well-differentiated (WD) tumor, moderately-differentiated (MD) tumor, and poorly-differentiated (PD) tumor.

The train+val+test sets contain 3 whole slide images (WSIs) and 6 colorectal tissue microarrays (TMAs) from 340 patients that were scanned at 40x magnification using an Aperio digital slide scanner (Leica Biosystems).
The Test_set_2 (Independent test set) includes 45 WSIs from 45 patients scanned using a NanoZoomer digital slide scanner (Hamamatsu Photonics K.K.), at 40x magnification.
The patch images are generated at 40x of size (~258 µm x 258 µm), then resize to 512x512 pixels (20x). For more detail, please refer to the paper above.

Dataset detail

Status	Training	Validation	Testing I	Testing II
Benign	773	374	453	27986
WD	1866	264	192	8394
MD	2997	370	738	61985
PD	1397	234	205	11895

Training + validation + testing I structure

├── KBSMC_colon_tma_cancer_grading_512
│ ├── tma01
│ │ ├── 100100_19493_0.jpg
│ │ ├── 100221_64120_3.jpg
│ │ ├── 100383_42563_2.jpg
│ │ ├── 100443_52718_2.jpg
│ │ ├── 100634_14705_1.jpg
│ │ ├── ...
│ ├── tma02
│ ├── tma03
│ ├── tma04
│ ├── tma05
│ ├── tma06
│ ├── wsi01
│ ├── wsi02
│ └── wsi03

Notes:

Train+Val+Test 1 Class label is the last digit in image name (bolden), for example 100100_19493_0.jpg belongs to benign class.

Test 2 Class label is the (last digit-1) in image name (bolden), for example 100100_19493_1.jpg belongs to benign class.

Simple way to load the dataset

Check out the dataset.py

def prepare_colon_tma_data(data_root_dir='./KBSMC_colon_tma_cancer_grading_512'):
    """ List all the images and their labels 
        return Training + validation + testing I
    """
    def load_data_info(pathname):
        file_list = glob.glob(pathname)
        label_list = [int(file_path.split('_')[-1].split('.')[0]) for file_path in file_list]
        print(Counter(label_list))
        return list(zip(file_list, label_list))

    set_tma01 = load_data_info('%s/tma01/*.jpg' % data_root_dir)
    set_tma02 = load_data_info('%s/tma02/*.jpg' % data_root_dir)
    set_tma03 = load_data_info('%s/tma03/*.jpg' % data_root_dir)
    set_tma04 = load_data_info('%s/tma04/*.jpg' % data_root_dir)
    set_tma05 = load_data_info('%s/tma05/*.jpg' % data_root_dir)
    set_tma06 = load_data_info('%s/tma06/*.jpg' % data_root_dir)
    set_wsi01 = load_data_info('%s/wsi01/*.jpg' % data_root_dir)  # benign exclusively
    set_wsi02 = load_data_info('%s/wsi02/*.jpg' % data_root_dir)  # benign exclusively
    set_wsi03 = load_data_info('%s/wsi03/*.jpg' % data_root_dir)  # benign exclusively

    train_set = set_tma01 + set_tma02 + set_tma03 + set_tma05 + set_wsi01
    valid_set = set_tma06 + set_wsi03
    test_set = set_tma04 + set_wsi02

    return train_set, valid_set, test_set

def prepare_colon_wsi_data(data_root_dir='./KBSMC_colon_45wsis_cancer_grading_512 (Test 2)'):
        """ List all the images and their labels 
        return test_set 2
    """
    def load_data_info_from_list(data_dir, path_list):
        file_list = []
        for WSI_name in path_list:
            pathname = glob.glob(f'{data_dir}/{WSI_name}/*/*.png')
            file_list.extend(pathname)
            label_list = [int(file_path.split('_')[-1].split('.')[0]) -1 for file_path in file_list]
        print(Counter(label_list))
        list_out = list(zip(file_list, label_list))
        return list_out

    wsi_list = ['wsi_001', 'wsi_002', 'wsi_003', 'wsi_004', 'wsi_005', 'wsi_006', 'wsi_007', 'wsi_008', 'wsi_009',
                'wsi_010', 'wsi_011', 'wsi_012', 'wsi_013', 'wsi_014', 'wsi_015', 'wsi_016', 'wsi_017', 'wsi_018',
                'wsi_019', 'wsi_020', 'wsi_021', 'wsi_022', 'wsi_023', 'wsi_024', 'wsi_025', 'wsi_026', 'wsi_027',
                'wsi_028', 'wsi_029', 'wsi_030', 'wsi_031', 'wsi_032', 'wsi_033', 'wsi_034', 'wsi_035', 'wsi_090',
                'wsi_092', 'wsi_093', 'wsi_094', 'wsi_095', 'wsi_096', 'wsi_097', 'wsi_098', 'wsi_099', 'wsi_100']

    test_set = load_data_info_from_list(data_root_dir, wsi_list)
    return test_set


class DatasetSerial(data.Dataset):
    """get image by index
    """
    def __init__(self, pair_list, img_transform=None, target_transform=None, two_crop=False):
        self.pair_list = pair_list

        self.img_transform = img_transform
        self.target_transform = target_transform
        self.num = self.__len__()

    def __getitem__(self, index):
        """
        Args:
            index (int): index
        Returns:
            tuple: (image, index, ...)
        """
        path, target = self.pair_list[index]
        image = pil_loader(path)

        # # image
        if self.img_transform is not None:
            img = self.img_transform(image)
        else:
            img = image

        return img, target

Citation

If any part of this dataset is used, please give appropriate citation to our paper.

@article{le2021joint,
  title={Joint categorical and ordinal learning for cancer grading in pathology images},
  author={Le Vuong, Trinh Thi and Kim, Kyungeun and Song, Boram and Kwak, Jin Tae},
  journal={Medical image analysis},
  pages={102206},
  year={2021},
  publisher={Elsevier}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE.md		LICENSE.md
README.md		README.md
colon_tma_tissue_sample.png		colon_tma_tissue_sample.png
dataset.py		dataset.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KBSMC colon cancer grading dataset repository

Download

Brief description

Dataset detail

Training + validation + testing I structure

Notes:

Simple way to load the dataset

Citation

About

Releases

Packages

Contributors 2

Languages

License

QuIIL/KBSMC_colon_cancer_grading_dataset

Folders and files

Latest commit

History

Repository files navigation

KBSMC colon cancer grading dataset repository

Download

Brief description

Dataset detail

Training + validation + testing I structure

Notes:

Simple way to load the dataset

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages