Train an ML model on a dataset#

In the previous tutorial, we loaded an entire dataset into memory to perform a simple analysis.

Here, we’ll iterate over the files within the dataset to train an ML model.

import lamindb as ln
import anndata as ad
import numpy as np

💡 lamindb instance: testuser1/test-scrna

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.63.2 numpy==1.26.2 torch==2.1.1

💡 saved: Transform(uid='Qr1kIHvK506rz8', name='Train an ML model on a dataset', short_name='scrna5', version='0', type=notebook, updated_at=2023-12-04 19:36:00 UTC, created_by_id=1)

💡 saved: Run(uid='yWf1S2Y626KGsA0YQ2H0', run_at=2023-12-04 19:36:00 UTC, transform_id=5, created_by_id=1)

Preprocessing#

Let us get our dataset:

dataset_v2 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()
dataset_v2

Dataset(uid='JaRPSpeonm0h8TK35Qeo', name='My versioned scRNA-seq dataset', version='2', hash='BOAf0T5UbN_iOe3fQDyq', visibility=1, updated_at=2023-12-04 19:35:38 UTC, transform_id=2, run_id=2, initial_version_id=1, created_by_id=1)

We’ll need to make a decision on the features that we want to use for training the model.

Because each file is validated, they’re all indexed by ensembl_gene_id in the var slot of AnnData.

To make our live easy, we’ll intersect features across files:

files = dataset_v2.files.all()
# the gene sets are stored in the "var" slot of features
shared_genes = files[0].features["var"]
for file in files[1:]:
    # QuerySet objects allow set operations
    shared_genes = shared_genes & file.features["var"]
shared_genes_ensembl = shared_genes.list("ensembl_gene_id")

We’ll now store the raw representations and create a training dataset:

raw_files = []
for file in files:
    adata_raw = file.load().raw[:, shared_genes_ensembl].to_adata()
    raw_file = ln.File(adata_raw, description=f"Raw data of file {file.uid}")
    raw_files.append(raw_file)
ln.save(raw_files)

ds_train = ln.Dataset(raw_files, name="My training dataset", version="2")
ds_train.save()
ds_train.view_flow()

PyTorch DataLoader#

If you need to train your model on a list of files, you can use mapped() with the PyTorch DataLoader.

It only loads batches into memory and thus allows to work with very large datasets.

from torch.utils.data import DataLoader, WeightedRandomSampler

Files in the dataset should have the same variables, we have already taken care of this.

ds_mapped = ds_train.mapped(label_keys=["cell_type"])

This is compatible with pytorch DataLoader because it implements __getitem__ over a list of AnnData files.

ds_mapped[5]

Show code cell output Hide code cell output

[array([ 0.,  0.,  0.,  2.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0., 17.,  0.,  0.,  0.,  2.,  0.,  0.,  2.,  1.,
         0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
         0.,  0.,  0.,  0.,  0.,  4.,  0.,  2.,  3.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
         3.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         1.,  0.,  3.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
         2.,  0.,  0.,  5.,  6.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,
         0.,  0.,  0.,  1.,  1.,  1.,  1.,  3.,  0.,  0.,  4.,  1.,  3.,
         0.,  0.,  0.,  0.,  0.,  2.,  0.,  2.,  1.,  0.,  0.,  1.,  0.,
         0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  2.,  0.,  1.,
         0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         5.,  0.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
         0.,  2.,  1.,  0.,  0.,  1.,  3.,  4.,  1.,  0.,  2.,  1.,  1.,
         1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  2.,  0.,  0.,  3.,
         0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., 96.,  0.,  6.,
         1.,  2.,  0.,  0.,  1.,  0.,  6.,  1.,  0.,  0.,  1.,  0.,  2.,
         0.,  3.,  0.,  0.,  2., 10.,  0.,  0.,  0.,  5.,  1., 26.,  2.,
        14.,  6.,  5.,  0.,  3.,  0.,  8., 10.,  0.,  1.,  0.,  1.,  0.,
         1.,  1.,  0.,  5.,  1.,  0.,  3.,  1.,  1.,  1.,  0.,  0.,  0.,
         2.,  1.,  3.,  0.,  0.,  1.,  3.,  3.,  0.,  0.,  2.,  0.,  1.,
         0.,  4.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  7.,  1.,  0.,
         0.,  0.,  0.,  0.,  0.,  2.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  4.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,
         0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  2.,  0.,  0.,  0.,
         0.,  0.,  0.,  2.,  0.,  1.,  2.,  0.,  1.,  0.,  2.,  0.,  1.,
         0.,  1.,  0.,  0.,  1.,  0.,  0.,  0., 17.,  0.,  0.,  0.,  0.,
         4.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         2.,  1., 13.,  0.,  1.,  1.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,
         0.,  0.,  0.,  2.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  2.,  2.,  1.,  0.,  0.,  0.,  6.,  0.,  2.,  0.,  0.,
         0.,  0.,  1.,  1.,  0.,  1.,  0.,  2.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  1.,  0.,  1.,  0.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  2.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  4.,  0.,  6.,  1.,  0.,  0.,  0.,  2.,  0.,  2.,
         1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  5.,  0.,  7.,  1.,  0.,
         0.,  0.,  0.,  1., 10.,  1.,  6.,  0.,  0.,  1.,  4.,  0.,  0.,
         0.,  0.,  0.,  2.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,
         0.,  0.,  0.,  0.,  0.,  0.,  1.,  2.,  0.,  0.,  0.,  0.,  1.,
         0.,  0.,  0.,  0.,  1.,  2.,  1.,  0.,  0.,  0.,  4.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  3.,  0.,  0.,  0.,  0.,
         0.,  0.,  2.,  0.,  0.,  0.,  3.,  0.,  0.,  3.,  0.,  0.,  0.,
         0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  2.,  0.,  0.,  2.,
         0.,  6.,  0.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  2.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  2.,  0.,  0.,  0.,  1.,
         1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  2.,
         0.,  0.,  0.,  0.,  0.,  1.,  3.,  4.,  0.,  0.,  0.,  1.,  0.,
        10.,  0.,  1.,  0.,  1.,  1.,  0.,  3.,  0.,  0.,  0.,  0.,  0.,
         1.,  0.,  0.,  4.,  0.,  0.,  0.,  0.,  0., 15.,  0.,  6.,  0.,
         1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
         0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  2.,  3.,  0.,  6.,
         1.,  1.,  0.,  1.,  1.,  0.,  2.,  0.,  1.,  0.,  0.,  0.,  1.,
         1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
         0.,  4.,  0.,  1.,  1.,  0.,  0.,  0.,  2.,  0.,  0.,  0.,  2.,
         1.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  2.,  0.,  1.,  0.,
         0.,  0., 53.,  1.,  0.,  1.,  0., 35.]),
 15]

The labels are encoded into integers.

ds_mapped.encoders

Let us use a weighted sampler:

# label_key for weight doesn't have to be in labels on init
sampler = WeightedRandomSampler(
    weights=ds_mapped.get_label_weights("cell_type"), num_samples=len(ds_mapped)
)
dl = DataLoader(ds_mapped, batch_size=128, sampler=sampler)

We can now iterate through the data loader:

for batch in dl:
    pass

Close the connections in MappedDataset:

ds_mapped.close()