Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
zmgong authored Jul 28, 2024
1 parent 9c7ccda commit 9ed7ea4
Showing 1 changed file with 0 additions and 38 deletions.
38 changes: 0 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,44 +93,6 @@ TODO: add the command for downloading the images and generating the hdf5 file.

You can check [BIOSCAN-1M](https://github.com/zahrag/BIOSCAN-1M) and [BIOSCAN-5M](https://github.com/zahrag/BIOSCAN-5M) to download tsv files.

## Data Structure in HDF5 Format

The data is stored in HDF5 format with the following structure. Each dataset contains multiple groups, each representing different splits of the data.


### Group Structure
Each group represents a specific data split and contains several datasets. The groups are organized as follows:

- `all_keys`: Contains all data that will be used as key during the evaluation.
- `val_seen`: Contains seen query data for validation.
- `test_seen`: Contains seen query data for testing.
- `seen_keys`: Contains seen data that will be used as key during the evaluation. Note, for BIOSCAN-5M, these data are also used for training.
- `test_unseen`: Contains unseen test data.
- `val_unseen`: Contains unseen validation data.
- `unseen_keys`: Contains unseen data that will be used as key during the evaluation.
- `no_split_and_seen_train`: All data that will be used for contrastive pretrain.

Notably, there are some slight differences in the group structure of the BIOSCAN-1M and BIOSCAN-5M data, but they are fundamentally consistent.

### Dataset Structure

Each group contains several datasets:

- `image`: Stores the image data as byte arrays.
- `image_mask`: Stores the length of each image byte array.
- `barcode`: Stores DNA barcode sequences.
- `family`: Stores the family classification of each sample.
- `genus`: Stores the genus classification of each sample.
- `order`: Stores the order classification of each sample.
- `sampleid`: Stores the sample IDs.
- `species`: Stores the species classification of each sample.
- `processid`: Stores the process IDs for each sample.
- `language_tokens_attention_mask`: Stores the attention masks for language tokens.
- `language_tokens_input_ids`: Stores the input IDs for language tokens.
- `language_tokens_token_type_ids`: Stores the token type IDs for language tokens.
- `image_file`: Stores the filenames of the images.


# Running experiments
We recommend the use of [weights and biases](https://wandb.ai/site) to track and log experiments

Expand Down

0 comments on commit 9ed7ea4

Please sign in to comment.