Preparing AIRR files

InterClone requires AIRR-formatted files, which are tab-delimited files with a number of specific column headers. In order to use InterClone, the following headers are required:

sequence_id, containing a unique identifier for a sequence entry

sequence_aa, containing the complete amino acid sequence

v_call, containing the assigned V gene name for chain filtering

clone_id, containing a unique identifier for each clone, used for paired result merging

Input files can be prepared from raw FASTA, 10X CellRanger, Illumina MIRA and MiXCR-formatted data. An import script is provided for each of these data sources. They can be found in the src/dataimport/ folder in the source code repository.

In the case of FASTA-formatted data, the user provides one or more input files containing full length amino acid sequences, along with the chain type of these sequences.

In the case of raw MiXCR TSV outputs, AIRR files are prepared as follows:

Conjugate the full length amino acid sequence by merging the CDR and framework regions from ‘aaSeqImputedFR1’, ‘aaSeqImputedCDR1’,’aaSeqImputedFR2’, ‘aaSeqImputedCDR2’, ‘aaSeqImputedFR3’, ‘aaSeqImputedCDR3’ and ‘aaSeqImputedFR4’. Discard sequences containing gaps or stop codons in these regions.

The columns ‘cloneId’, ‘allVHitsWithScore’, ‘allJHitsWithScore’, ‘aaSeqImputedCDR3’, ‘allCHitsWithScore’, and ‘cloneCount’ from the MiXCR output file are renamed to ‘clone_id’, ‘v_call’, ‘j_call’, ‘cdr3’, ‘c_call’ and ‘clone_count’, respectively.

In the case of 10X CellRanger output, three files are required: ‘airr_rearrangement.tsv’, ‘clonotypes.csv’ and ‘all_contig_annotations.csv’. To prepare AIRR formatted files, the following steps are taken:

Examine the quality of each clone: Only clones that contain one paired heavy/light or alpha/beta chain are used. Clones containing multiple chains or single chains are discarded. The frequency of each clone taken from ‘clonotypes.csv’ and saved as ‘clone_count’ in the AIRR file.

Select one of the contigs that share the same clonotype and use the columns ‘cell_id’, ‘v_call’, ‘j_call’, ‘cdr3’, ‘c_call’ and amino acid sequence for the AIRR file.