High-Throughput Robust Analysis for Capillary Electrophoresis
For Mutate-and-Map data, we can draw inferences from the changes in the chemical mapping reactivities as mutations are carried out. Thus the relative values are OK. However, for quantitative analysis, we need to correct for several factors arise from the experiment measurements:
Background Stops: the reverse transcriptase has intrinsic stops along the sequence. You may see such sites in a nomod profile. We need to subtract out these to estimate true chemical reactivities.
Signal Attenuation: the reverse transcriptase terminates (is blocked) at the first modification site it encounters from the 3′ end. Thus, the number of cDNA fragments attenuates as the length gets longer. For a long RNA, it is easy to spot that the longer reverse transcription products will have attenuated intensity compared to shorter ones. We need to correct for such directional bias.
In principle, this could all be carried out accurately and analytically if we could measure the amount of unmodified product – the very longest (full-length extended )band in the capillary trace. But under ‘single hit conditions’, this fully extended band typically saturates our fluorescence detector. And this leads to the next item:
Signal Saturation: the fluorescence camera of the CE machine has a certain detection range. For signals that are beyond the maximum value, they cannot be measured accurately and thus appear plateaued. This is common for the full-length extended band of the gel since the majority of input RNA are unmodified under single-hit kinetics. We dilute the same sample in order to retrieve the intensity of such peaks.
Normalization: to scale up the final reactivity profile to a common standard so that reactivities from different experiment (e.g. different RNA) can be compared to each other. This is where we the GAGUA reference hairpin comes to help.
HiTRACE offers a likelihood-based framework for estimating the scaling and attenuation correction factors.
An obsolete script
overmod_and_background_correct_logL()
was used to optimize the attenuation coefficient (by a grid search) and the scaling for the background (by linear programming) so as to best match your expectation. Please useget_reactivities()
instead!
We first need to define some parameters:
ref_segment = 'GAGUA';
ref_peak = get_ref_peak(sequence, ref_segment, offset);
It searches for GAGUA sequence (which is the pentaloop of both reference hairpins at 5′ and 3′) and records the indices of those residues into ref_peak
.
Next, we define the standard deviation to use as a cutoff for saturation in unsaturate()
step. We basically always use:
sd_cutoff = 1.5;
Now we need to split the dataset into a diluted half and a saturated (undiluted) half. We exclude any ddNTP lanes since we are not interested in quantitating them. For the DMS, CMCT, and SHAPE profiles, recall from the layout here. We know that:
saturated_idx = [9, 10, 13, 14, 17 ,18 ,21, 22, 25, 26, 29, 30, 33, 34, 37, 38, 41, 42, 45, 46, 49, 50, 53, 54];
diluted_idx = saturated_idx + 2;
Not to be confused with
reorder
from Step #1. The reordering of CE traces into our favorite ordering was done byquick_look()
, and only needed once. From that moment on, the data ind_align
(hencearea_peak
) is in the correct order. The indices insaturated_idx
anddiluted_idx
should refer toarea_peak
locally.
For the nomod lanes, we need to average them before running get_reactivities()
. Since conditions of absence and presence of ligand should not be mixed together, we do:
saturated_array_nomod = [mean(area_peak(:, 1:2), 2), mean(area_peak(:, 5:6), 2)];
diluted_array_nomod = [mean(area_peak(:, 3:4), 2), mean(area_peak(:, 7:8), 2)];
This is only needed for this example since it has replicates for each condition (e.g. 2 nomod lanes). Since we cannot subtract data to multiple background lanes, we are averaging them into 1 for each condition.
The above command averages lane 1 with 2 (without ligand), 5 with 6 (with ligand) for saturated nomod data; lane 3 with 4 (without ligand), 7 with 8 (with ligand) for diluted nomod data. Note the use of mean()
by row. Now let’s merge the nomod data with all the other modifiers; same for errors:
saturated_array = [saturated_array_nomod, area_peak(:, saturated_idx)];
diluted_array = [diluted_array_nomod, area_peak(:, diluted_idx)];
saturated_error = [ ...
mean(darea_peak(:, 1:2), 2), ...
mean(darea_peak(:, 5:6), 2),...
darea_peak(:, saturated_idx), ...
];
diluted_error = [ ...
mean(darea_peak(:, 3:4), 2), ...
mean(darea_peak(:, 7:8), 2),...
darea_peak(:, diluted_idx), ...
];
The
saturated_array
should matchsaturated_error
in column orders; same fordiluted_array
anddiluted_error
.
You can bypass the saturation correction if you did not perform dilution in your data. To do so, first still define your
saturated_array
as above. And then usediluted_array = saturated_array;
. In this way, it won’t find anything to correct when comparingdiluted_array
tosaturated_array
.
Lastly, we need to specify the background columns for background subtraction. In our saturated_array
has the columns as:
nomod (-), nomod (+), DMS (-) x4, DMS (+) x4, CMCT (-) x4, CMCT (+) x4, SHAPE (-) x4, SHAPE (+) x4.
[(-) and (+) mark ligand condition.]
All the (-) columns should be subtracted by nomod (-) while all the (+) columns should be subtracted by nomod (+). The column indices of nomod (-) and nomod (+) in saturated_array
are 1 and 2, thus:
bkg_col = [1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2];
The numberings in
bkg_col
are based on column indices ofsaturated_array
anddiluted_array
; not to be confused withsaturated_idx
anddiluted_idx
that are based onarea_peak
.
Finally, we can kick off the quantitation with:
[normalized_reactivity, normalized_error, seqpos_out] = get_reactivities( ...
saturated_array, diluted_array, saturated_error, diluted_error, ...
bkg_col, ref_peak, seqpos, [], data_types([1, 2, saturated_idx]), sequence, offset, sd_cutoff);
The argument
data_types(saturated_idx)
provides information of modifier identity for the lanes insaturated_array
anddiluted_array
. Alternatively,data_types([1, 2, saturated_idx])
could be written as[{'nomod'}, {'nomod'}, data_types(saturated_idx)]
. It’s important that it matches with the lanes identity insaturated_array
anddiluted_array
, sincearea_peak
is not used in this function call!
Recall the numbers of nucleotides reactive in GAGUA to DMS [2], CMCT [1], and SHAPE [5] are different. This information helps normalization step to scale up the profiles properly (by averaging among the right number of reactive nucleotides within GAGUA).
To better clarify the confusing indices and rearrangments, please read this illustration.
The run goes through the 4 processes, and waits you to continue between steps:
The saturated sample is on the left, diluted sample (after scaling up to match saturated sample) is in the middle, and the result after unsaturate()
is on the right. The red lines show where a plateaued reactivity is found in saturated sample (by comparing to scaled diluted sample).
The data is now plotted by seqpos
: 5′ on the top. The attenuation (lighter from bottom to top) is now corrected.
After this step, the extra full-length extened band is removed.
After subtracting the background, the data now looks cleaner. Of course the _nomod_lanes are completely blank now.
Reactive residues in GAGUA reference loops are scaled to reactivity of 1.0 on average.
If no
ref_peak
is given, the normalized reactivites may have not been scaled correctly.
Once finished, the get_reactivities()
command returns the following variables:
Variable | Type | Description |
---|---|---|
normalized_reactivity |
SxP double | Final reactivity profiles. P is the number of columns in saturated_array . Note that the number of rows is now S instead of S+1. |
normalized_error |
SxP double | Error estimates for normalized_reactivity that are propagated. |
seqpos_out |
1xS int | Numbers of sequence positions from 5′ to 3′ end. Note that is has the full-length extended band position removed compared to seqpos . |
Again, you do not need to process your 2D dataset using this step!
Built with Jekyll using a RiboKit Theme . Hosted on GitHub Pages.