RiboKit : HiTRACE

High-Throughput Robust Analysis for Capillary Electrophoresis

Step 6: Data Corrections (for 1D data only)


For Mutate-and-Map data, we can draw inferences from the changes in the chemical mapping reactivities as mutations are carried out. Thus the relative values are OK. However, for quantitative analysis, we need to correct for several factors arise from the experiment measurements:

In principle, this could all be carried out accurately and analytically if we could measure the amount of unmodified product – the very longest (full-length extended )band in the capillary trace. But under ‘single hit conditions’, this fully extended band typically saturates our fluorescence detector. And this leads to the next item:


HiTRACE offers a likelihood-based framework for estimating the scaling and attenuation correction factors.

An obsolete script overmod_and_background_correct_logL() was used to optimize the attenuation coefficient (by a grid search) and the scaling for the background (by linear programming) so as to best match your expectation. Please use get_reactivities() instead!

We first need to define some parameters:

ref_segment = 'GAGUA';
ref_peak = get_ref_peak(sequence, ref_segment, offset);

It searches for GAGUA sequence (which is the pentaloop of both reference hairpins at 5′ and 3′) and records the indices of those residues into ref_peak.

Next, we define the standard deviation to use as a cutoff for saturation in unsaturate() step. We basically always use:

sd_cutoff = 1.5;

Now we need to split the dataset into a diluted half and a saturated (undiluted) half. We exclude any ddNTP lanes since we are not interested in quantitating them. For the DMS, CMCT, and SHAPE profiles, recall from the layout here. We know that:

saturated_idx = [9, 10, 13, 14, 17 ,18 ,21, 22, 25, 26, 29, 30, 33, 34, 37, 38, 41, 42, 45, 46, 49, 50, 53, 54];
diluted_idx = saturated_idx + 2;

Not to be confused with reorder from Step #1. The reordering of CE traces into our favorite ordering was done by quick_look(), and only needed once. From that moment on, the data in d_align (hence area_peak) is in the correct order. The indices in saturated_idx and diluted_idx should refer to area_peak locally.

For the nomod lanes, we need to average them before running get_reactivities(). Since conditions of absence and presence of ligand should not be mixed together, we do:

saturated_array_nomod = [mean(area_peak(:, 1:2), 2),  mean(area_peak(:, 5:6), 2)];
diluted_array_nomod = [mean(area_peak(:, 3:4), 2),  mean(area_peak(:, 7:8), 2)];

This is only needed for this example since it has replicates for each condition (e.g. 2 nomod lanes). Since we cannot subtract data to multiple background lanes, we are averaging them into 1 for each condition.

The above command averages lane 1 with 2 (without ligand), 5 with 6 (with ligand) for saturated nomod data; lane 3 with 4 (without ligand), 7 with 8 (with ligand) for diluted nomod data. Note the use of mean() by row. Now let’s merge the nomod data with all the other modifiers; same for errors:

saturated_array = [saturated_array_nomod, area_peak(:, saturated_idx)];
diluted_array = [diluted_array_nomod, area_peak(:, diluted_idx)];

saturated_error = [ ...
    mean(darea_peak(:, 1:2), 2), ...
    mean(darea_peak(:, 5:6), 2),...
    darea_peak(:, saturated_idx), ...
];
diluted_error = [ ...
    mean(darea_peak(:, 3:4), 2), ...
    mean(darea_peak(:, 7:8), 2),...
    darea_peak(:, diluted_idx), ...
];

The saturated_array should match saturated_error in column orders; same for diluted_array and diluted_error.

You can bypass the saturation correction if you did not perform dilution in your data. To do so, first still define your saturated_array as above. And then use diluted_array = saturated_array;. In this way, it won’t find anything to correct when comparing diluted_array to saturated_array.

Lastly, we need to specify the background columns for background subtraction. In our saturated_array has the columns as:

All the (-) columns should be subtracted by nomod (-) while all the (+) columns should be subtracted by nomod (+). The column indices of nomod (-) and nomod (+) in saturated_array are 1 and 2, thus:

bkg_col  = [1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2];

The numberings in bkg_col are based on column indices of saturated_array and diluted_array; not to be confused with saturated_idx and diluted_idx that are based on area_peak.

Finally, we can kick off the quantitation with:

[normalized_reactivity, normalized_error, seqpos_out] = get_reactivities( ...
    saturated_array, diluted_array, saturated_error, diluted_error, ...
    bkg_col, ref_peak, seqpos, [], data_types([1, 2, saturated_idx]), sequence, offset, sd_cutoff);

The argument data_types(saturated_idx) provides information of modifier identity for the lanes in saturated_array and diluted_array. Alternatively, data_types([1, 2, saturated_idx]) could be written as [{'nomod'}, {'nomod'}, data_types(saturated_idx)]. It’s important that it matches with the lanes identity in saturated_array and diluted_array, since area_peak is not used in this function call!

Recall the numbers of nucleotides reactive in GAGUA to DMS [2], CMCT [1], and SHAPE [5] are different. This information helps normalization step to scale up the profiles properly (by averaging among the right number of reactive nucleotides within GAGUA).

To better clarify the confusing indices and rearrangments, please read this illustration.


The run goes through the 4 processes, and waits you to continue between steps:

get_reactivities Figure Unsaturation

The saturated sample is on the left, diluted sample (after scaling up to match saturated sample) is in the middle, and the result after unsaturate() is on the right. The red lines show where a plateaued reactivity is found in saturated sample (by comparing to scaled diluted sample).

get_reactivities Figure Attenuation Correction

The data is now plotted by seqpos: 5′ on the top. The attenuation (lighter from bottom to top) is now corrected.

After this step, the extra full-length extened band is removed.

get_reactivities Figure Background Subtraction

After subtracting the background, the data now looks cleaner. Of course the _nomod_lanes are completely blank now.

get_reactivities Figure Normalization

Reactive residues in GAGUA reference loops are scaled to reactivity of 1.0 on average.

If no ref_peak is given, the normalized reactivites may have not been scaled correctly.

Once finished, the get_reactivities() command returns the following variables:

Variable Type Description
normalized_reactivity SxP double Final reactivity profiles. P is the number of columns in saturated_array. Note that the number of rows is now S instead of S+1.
normalized_error SxP double Error estimates for normalized_reactivity that are propagated.
seqpos_out 1xS int Numbers of sequence positions from 5′ to 3′ end. Note that is has the full-length extended band position removed compared to seqpos.

Again, you do not need to process your 2D dataset using this step!

Built with Jekyll using a RiboKit Theme . Hosted on GitHub Pages.