Title: | Analysis and Visualization Tools for Genetic Barcode Data |
---|---|
Description: | Provides the necessary functions to identify and extract a selection of already available barcode constructs (Cornils, K. et al. (2014) <doi:10.1093/nar/gku081>) and freely choosable barcode designs from next generation sequence (NGS) data. Furthermore, it offers the possibility to account for sequence errors, the calculation of barcode similarities and provides a variety of visualisation tools (Thielecke, L. et al. (2017) <doi:10.1038/srep43249>). |
Authors: | Lars Thielecke [aut, cre] |
Maintainer: | Lars Thielecke <[email protected]> |
License: | LGPL |
Version: | 1.2.8 |
Built: | 2025-03-10 16:13:35 UTC |
Source: | https://github.com/cran/genBaRcode |
Creates a search file for a command line grep search.
.createPatternFile(bc_backbone, patterns_file)
.createPatternFile(bc_backbone, patterns_file)
bc_backbone |
a character string (barcode pattern). |
patterns_file |
a character string (file name) |
Generates a collection of colors for a list of barcodes based on their identified minimum hamming distances.
.generateColors(minHD, type = "rainbow", alpha = 1)
.generateColors(minHD, type = "rainbow", alpha = 1)
minHD |
a numeric vector of all the minimum hamming distances. |
type |
a character string. Possible Values are "rainbow", "heat.colors", "topo.colors", "greens", "wild". |
alpha |
a numeric value between 0 and 1, modifies colour transparency. |
Identifies the barcode positions within the barcode backbone and generates a awk command.
.getBarcodeFilter(wobble_pos)
.getBarcodeFilter(wobble_pos)
wobble_pos |
a character string. |
Generates a matrix index to create a square triangular matrix.
.getDiagonalIndex(n)
.getDiagonalIndex(n)
n |
an integer indicating the size of the resulting index matrix. |
a locigal matrix of size n
x n
Calculates the minimum distance to a set of predefined barcodes for a given list of barcode.
.getMinDist(BC_dat, ori_BCs, m = "hamming")
.getMinDist(BC_dat, ori_BCs, m = "hamming")
BC_dat |
a BCdat object |
ori_BCs |
a character vector containing barcodes to which the minimal hamming distance will be calculated. |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). |
Extracts barcode positions.
.getWobblePos(bc_backbone = "")
.getWobblePos(bc_backbone = "")
bc_backbone |
a character vector. |
Converts hex colors into gephi usable rgb colors
.hex2rgbColor(colrs)
.hex2rgbColor(colrs)
colrs |
a character vector containing a list of hex colors |
a color vector.
Converts a vector of character strings (DNA sequences) into its reverse complement.
.revComp(seq_dat)
.revComp(seq_dat)
seq_dat |
a character vector containing DNA sequences |
Converts a vector of equally long character strings into its reverse complement.
.revComp_EqLength(seq_dat, word_length)
.revComp_EqLength(seq_dat, word_length)
seq_dat |
a character vector. |
word_length |
an integer giving the word length. |
Converts a vector of unequally long character strings into the reverse complement.
.revComp_UneqLength(seq_dat)
.revComp_UneqLength(seq_dat)
seq_dat |
A character vector. |
Checks directory paths for correctness and if nessesary corrects them.
.testDirIdentifier(s)
.testDirIdentifier(s)
s |
a character string. |
Converts a data.frame into a BCdat object.
asBCdat(dat, label = "empty", BC_backbone = "none", resDir = getwd())
asBCdat(dat, label = "empty", BC_backbone = "none", resDir = getwd())
dat |
a data.frame object with two columns containing read counts and barcode sequences. |
label |
a optional character string used as label. |
BC_backbone |
a optional character string, describing the barcode backbone structure. |
resDir |
a optional character string, identifying the path to the results directory, default is current working directory. |
a BCdat object.
A dataset containing an example BCdat object which consists of 98 barcode sequences and with no error correction yet.
BC_dat
BC_dat
A S4 data object with the following slots:
sequence overview
a data frame consisting of read counts and barcode sequences
path to a directory for any kind of results
a string clarifying the barcode backbone structure
character string, used as label for file names etc.
BC_dat:
A dataset containing an example BCdat object after error-correction which consists of 10 barcode sequences.
BC_dat_EC
BC_dat_EC
A S4 data object with the following slots:
sequence overview
a data frame consisting of read counts and barcode sequences
path to a directory for any kind of results
a string clarifying the barcode backbone structure
character string, used as label for file names etc.
BC_dat_EC:
BCdat class.
reads
data.frame containing barcode sequences and their corresponding read counts.
results_dir
character string of the working directory path.
label
character string identifying the particular experiment (will be part of the names of any file created).
BC_backbone
character string of the used barcode design (also called barcode backbone).
Compairing two BCdat Objects
com_pair(BC_dat1 = NULL, BC_dat2 = NULL)
com_pair(BC_dat1 = NULL, BC_dat2 = NULL)
BC_dat1 |
the first BCdat object. |
BC_dat2 |
the second BCdat object. |
a list containing the shared and the unqiue barcodes.
createGDF creates a data file usable with the free graph visualisation tool gephi. The nodes
represent barcodes and its respective size reflects the corresponding read counts. Edges between nodes indicate
a distance between two barcodes of maximal minD
.
If ori_BCs
is provided the node color refelects the distance of a particular barcode to one
of the provided barcode sequences.
createGDF( BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, col_type = "rainbow", m = "hamming" )
createGDF( BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, col_type = "rainbow", m = "hamming" )
BC_dat |
a BCdat object. |
minDist |
an integer value representing the maximal distance value for which the graph will contain edges. |
loga |
a logical value indicating the use or non-use of logarithmic read count values. |
ori_BCs |
a vector of character strings containing the barcode sequences (without the fixed positions of the barcode construct). |
col_type |
character sting, choosing one of the available color palettes. |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). |
## Not run: data(BC_dat) createGDFFile(BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, col_type = "rainbow") ## End(Not run)
## Not run: data(BC_dat) createGDFFile(BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, col_type = "rainbow") ## End(Not run)
creates a circle plot based on the additional data gathered by the error_correction function (EC_analysis needs to be set to TRUE). This function is intended to visualize the error correction procedure.
error_correction_circlePlot(edges, vertices)
error_correction_circlePlot(edges, vertices)
edges |
a data frame containing edge definitions by two columns calles "from" and "to". Such data frame will be returned by the error_correction function with the EC_analysis parameter set to TRUE. |
vertices |
a data frame with at least one column containing a list of nodes (also returned by the error_correction function with the EC_analysis parameter set to TRUE) |
a ggplot2 object.
This function will create a jitter plot displaying the maximal distances within each of the barcode sequence clusters.
error_correction_clustered_HDs(datEC, size = 0.75)
error_correction_clustered_HDs(datEC, size = 0.75)
datEC |
a BC_dat object, returned by the error_correction function with the EC_analysis parameter set to TRUE. |
size |
a numeric value, specifying the dot size. |
a ggplot2 object.
creates a Tree Plot visualising of the barcode clustering as part of the error correction process.
error_correction_treePlot(edges, vertices)
error_correction_treePlot(edges, vertices)
edges |
a data frame containing edge definitions by two columns calles "from" and "to". Such data frame will be returned by the error_correction function with the EC_analysis parameter set to TRUE. |
vertices |
a data frame with at least one column containing a list of nodes (also returned by the error_correction function with the EC_analysis parameter set to TRUE) |
a ggplot2 object.
Corrects a list of equally long (barcode) sequences. Based on calculated hamming distances as a measure of similarity, highly similar sequences are clustered together and the cluster label will be the respective sequence with the highest read count.
errorCorrection( BC_dat, maxDist, save_it = FALSE, cpus = 1, strategy = "sequential", m = "hamming", type = "standard", only_EC_BCs = TRUE, EC_analysis = FALSE, start_small = TRUE )
errorCorrection( BC_dat, maxDist, save_it = FALSE, cpus = 1, strategy = "sequential", m = "hamming", type = "standard", only_EC_BCs = TRUE, EC_analysis = FALSE, start_small = TRUE )
BC_dat |
one or a list of BCdat objects, containing the necessary sequences. |
maxDist |
an integer value representing the maximal hamming distance for which it is allowed to cluster two sequences together. |
save_it |
a logical value. If TRUE the data will be saved as csv-file. |
cpus |
an integer value, in case multiple BCdat objects are provided a CPU number greater than one would allow for a parallelized calculation (one CPU per BCdat object). |
strategy |
since the future package is used for parallelisation a strategy has to be stated, the default is "sequential" (cpus = 1) and "multiprocess" (cpus > 1). It is not necessary to chose a certain strategy, since it will be adjusted accordingly to the number of cpus which were choosen. For further information please read future::plan() R-Documentation. |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information) |
type |
there are different error correction strategies avalable ("standard", "connectivity based", "graph based", "clustering"). |
only_EC_BCs |
a logical value. If TRUE only informations about barcodes which are still present after error correction will be saved. Only meaningful if EC_analysis is set to TRUE. |
EC_analysis |
a logical value. If TRUE additional error correction details will be returned, which can also be visualised with the respective "error correction" plots. |
start_small |
a logical value. If TRUE, the error correcton type "standard" will cluster always the smallest highly similar BC with the BC of interest. IF FALSE, the error correcton type "standard" will adapt its cluster strategy and cluster always BC of interest with the most frequent highly similar BC. |
data(BC_dat) BC_dat_EC <- errorCorrection(BC_dat, maxDist = 8, save_it = FALSE, m = "hamming")
data(BC_dat) BC_dat_EC <- errorCorrection(BC_dat, maxDist = 8, save_it = FALSE, m = "hamming")
Extracts barcodes according to the given barcode design from a fastq file.
extractBarcodes( dat, label, results_dir = "./", mismatch = 0, indels = FALSE, bc_backbone, full_output = FALSE, cpus = 1, strategy = "sequential", wobble_extraction = TRUE, dist_measure = "hamming" )
extractBarcodes( dat, label, results_dir = "./", mismatch = 0, indels = FALSE, bc_backbone, full_output = FALSE, cpus = 1, strategy = "sequential", wobble_extraction = TRUE, dist_measure = "hamming" )
dat |
a ShortReadQ object. |
label |
a character string. |
results_dir |
a character string which contains the path to the results directory. |
mismatch |
an positive integer value, default is 0, if greater values are provided they indicate the number of allowed mismatches when identifing the barcode constructe. |
indels |
under construction. |
bc_backbone |
a character string or character vector describing the barcode design, variable positions have to be marked with the letter 'N'. |
full_output |
a logical value. If TRUE additional output files will be generated in order to identify errors. |
cpus |
an integer value, indicating the number of available cpus. |
strategy |
since the future package is used for parallelisation a strategy has to be stated, the default is "sequential" (cpus = 1) and "multiprocess" (cpus > 1). For further information please read future::plan() R-Documentation. |
wobble_extraction |
a logical value. If TRUE, single reads will be stripped of the backbone and only the "wobble" positions will be left. |
dist_measure |
a character value. If "bc_backbone = 'none'", single reads will be clustered based on a distance measure. Available distance methods are Optimal string aligment ("osa"), Levenshtein ("lv"), Damerau-Levenshtein ("dl"), Hamming ("hamming"), Longest common substring ("lcs"), q-gram ("qgram"), cosine ("cosine"), Jaccard ("jaccard"), Jaro-Winkler ("jw"), distance based on soundex encoding ("soundex"). For more detailed information see stringdist function of the stringdist-package for more information) |
one or a list of frequency table(s) of barcode sequences.
## Not run: bc_backbone <- "ACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANN" source_dir <- system.file("extdata", package = "genBaRcode") dat <- ShortRead::readFastq(dirPath = source_dir, pattern = "test_data.fastq.gz") extractBarcodes(dat, label = "test", results_dir = getwd(), mismatch = 0, indels = FALSE, bc_backbone) ## End(Not run)
## Not run: bc_backbone <- "ACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANN" source_dir <- system.file("extdata", package = "genBaRcode") dat <- ShortRead::readFastq(dirPath = source_dir, pattern = "test_data.fastq.gz") extractBarcodes(dat, label = "test", results_dir = getwd(), mismatch = 0, indels = FALSE, bc_backbone) ## End(Not run)
Launches the corresponding shiny app.
genBaRcode_app(dat_dir = system.file("extdata", package = "genBaRcode"))
genBaRcode_app(dat_dir = system.file("extdata", package = "genBaRcode"))
dat_dir |
a character string, identifying the path to one or more fast(q) files which shall be analysed, default is the path to the package inherent example fastq file |
Generates a barplot based on read counts. If ori_BCs
is provided the bar color reflects the
distance between a particular barcode to one of the provided barcode sequences.
generateKirchenplot( BC_dat, ori_BCs = NULL, ori_BCs2 = NULL, loga = TRUE, col_type = NULL, m = "hamming", setLabels = c("BC-Set 1", "Rest", "BC-Set 2") )
generateKirchenplot( BC_dat, ori_BCs = NULL, ori_BCs2 = NULL, loga = TRUE, col_type = NULL, m = "hamming", setLabels = c("BC-Set 1", "Rest", "BC-Set 2") )
BC_dat |
a BCdat object. |
ori_BCs |
a vector of character strings containing known barcode sequences (without the fixed positions of the barcode construct). |
ori_BCs2 |
a vector of character strings containing a 2nd set of known barcode sequences (also without the fixed positions). |
loga |
a logical value, indicating the use or non-use of logarithmic read count values. |
col_type |
character string, choosing one of the availabe color palettes ("rainbow", "heat.colors", "topo.colors", "greens", "wild" - see package "grDevices") |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). If neither 'ori_BCs' nor 'ori_BCs2' are provided with input the choice of 'm' does not matter. |
setLabels |
a character vector, containing three strings serving as plot labels. |
a ggplot2 object
Generates a matrix containing barcodes sequences as rows and consecutive measurements at columns. It serves as the necessary data object for the plotting function 'plotTimeSeries'.
generateTimeSeriesData(BC_dat_list)
generateTimeSeriesData(BC_dat_list)
BC_dat_list |
a list of BCdat objects. |
a data.frame containing every identified barcode and its read count per time point/measurement.
Accessing the Barcode Backbone slot of a BCdat objects.
getBackbone(object)
getBackbone(object)
object |
a BCdat object. |
A character string.
data(BC_dat) getBackbone(BC_dat)
data(BC_dat) getBackbone(BC_dat)
allows the user to choose between predefined backbone sequences. Excecution of the function without any parameter value will display all available backbone sequences. The id parameter will accept the name of the backbone or the rownumber of the shown selection.
getBackboneSelection(id = NULL)
getBackboneSelection(id = NULL)
id |
an integer or character value in order to choose a specific backbone. |
a character string.
getBackboneSelection() getBackboneSelection(2) getBackboneSelection("BC32-Venus")
getBackboneSelection() getBackboneSelection(2) getBackboneSelection("BC32-Venus")
Accessing the Label slot of a BCdat objects.
getLabel(object)
getLabel(object)
object |
a BCdat object. |
A character string.
data(BC_dat) getLabel(BC_dat)
data(BC_dat) getLabel(BC_dat)
Accessing the Read-Count slot of a BCdat objects.
getReads(object)
getReads(object)
object |
a BCdat object. |
A data.frame containing the read count table of the object paramter.
data(BC_dat) getReads(BC_dat)
data(BC_dat) getReads(BC_dat)
Accessing the Results Directory slot of a BCdat objects.
getResultsDir(object)
getResultsDir(object)
object |
a BCdat object. |
A character string.
data(BC_dat) getResultsDir(BC_dat)
data(BC_dat) getResultsDir(BC_dat)
ggplotDistanceGraph will create a graph-like visualisation (ripple plot) of the corresponding barcode sequences
and their similarity based on the ggplot2 and the ggnetwork packages. The nodes represent the barcode sequences and their
respective size reflects the corresponding read counts. Edges between nodes indicate a distance between two barcodes
of maximal minDist
.
If ori_BCs
is provided the node color also refelects the distance of a particular barcode to one of the initial
barcodes.
ggplotDistanceGraph( BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, lay = "fruchtermanreingold", complete = FALSE, col_type = "rainbow", m = "hamming", scale_nodes = 1, scale_edges = 1, legend_size = 4 )
ggplotDistanceGraph( BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, lay = "fruchtermanreingold", complete = FALSE, col_type = "rainbow", m = "hamming", scale_nodes = 1, scale_edges = 1, legend_size = 4 )
BC_dat |
a BCdat object. |
minDist |
an integer value representing the maximal distance for which the graph will contain edges. |
loga |
a logical value, indicating the use or non-use of logarithmic read count values. |
ori_BCs |
a vector of character strings containing the barcode sequences (without the fixed positions of the barcode construct). |
lay |
a character string, identifying the prefered layout algorithm (see ggnetwork layout option, "?gplot.layout"). Default value is "fruchtermanreingold", but possible are also "circle", "eigen", "kamadakawai", "spring" and many more. Or the user provides a two-column matrix with as many rows as there are nodes in the network, in which case the matrix is used as nodes coordinates. |
complete |
a logical value. If TRUE, every node will have at least one edge. |
col_type |
a character sting, choosing one of the available color palettes ("rainbow", "heat.colors", "topo.colors", "greens", "wild" - see package "grDevices"). |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). |
scale_nodes |
a numeric value, scaling the node size. |
scale_edges |
a numeric value, scaling the edge size. |
legend_size |
a numeric value, scaling the legend symbol size, if legend_size equals 0, the legend will be dismissed. |
a ggplot2 object
## Not run: data(BC_dat) ggplotDistanceGraph(BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, lay = "fruchtermanreingold", complete = FALSE, col_type = "rainbow") ## End(Not run)
## Not run: data(BC_dat) ggplotDistanceGraph(BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, lay = "fruchtermanreingold", complete = FALSE, col_type = "rainbow") ## End(Not run)
ggplotDistanceGraph will create a graph-like visualisation (ripple plot) of the corresponding barcode sequences
and their similarity based on the ggplot2 and the ggnetwork packages. The nodes represent the barcode sequences and their
respective size reflects the corresponding read counts. Edges between nodes indicate a distance between two barcodes
of maximal minDist
.
If ori_BCs
is provided the node color also refelects the distance of a particular barcode to one of the initial
barcodes.
ggplotDistanceGraph_EC( BC_dat, BC_dat_EC, minDist = 1, loga = TRUE, equal_node_sizes = TRUE, BC_threshold = NULL, ori_BCs = NULL, lay = "fruchtermanreingold", complete = FALSE, col_type = "rainbow", m = "hamming", scale_nodes = 1, scale_edges = 1 )
ggplotDistanceGraph_EC( BC_dat, BC_dat_EC, minDist = 1, loga = TRUE, equal_node_sizes = TRUE, BC_threshold = NULL, ori_BCs = NULL, lay = "fruchtermanreingold", complete = FALSE, col_type = "rainbow", m = "hamming", scale_nodes = 1, scale_edges = 1 )
BC_dat |
a BCdat object. |
BC_dat_EC |
the error corrected BCdat object (the EC_analysis parameter needs to be set to TRUE). |
minDist |
an integer value representing the maximal distance for which the graph will contain edges. |
loga |
a logical value, indicating the use or non-use of logarithmic read count values. |
equal_node_sizes |
a logical value. If TRUE, every node will have the same size. |
BC_threshold |
a nnumeric value, limiting the number of barcodes for which their error correction "history" will be colored (if BC_threshold = 5 then the five biggest barcodes will be evaluated) |
ori_BCs |
a vector of character strings containing barcode sequences (without the fixed positions of the barcode construct). Similar to BC_threshold but allowing for barcode identification via sequence. |
lay |
a character string, identifying the prefered layout algorithm (see ggnetwork layout option). |
complete |
a logical value. If TRUE, every node will have at least one edge. |
col_type |
a character sting, choosing one of the available color palettes. |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). |
scale_nodes |
a numeric value, scaling the node size. |
scale_edges |
a numeric value, scaling the edge size. |
a ggplot2 object
Experimental function to identify hybrid barcodes which can occure due to unfinished synthesis of a template in-between PCR cycles.
hybridsIdentification(dat, min_seq_length = 10)
hybridsIdentification(dat, min_seq_length = 10)
dat |
a character vector containing barcode sequences or a BCdat object. |
min_seq_length |
a positive integer value indicating the minimal length of the two barcodes which give rise to a hybrid barcode. |
a hybrid-free frequency table of barcode sequences
Generates a tree plot based on a herachical clustering of the complete distance matrix.
plotClusterGgTree(BC_dat, tree_est = "NJ", type = "rectangular", m = "hamming")
plotClusterGgTree(BC_dat, tree_est = "NJ", type = "rectangular", m = "hamming")
BC_dat |
a BCdat object. |
tree_est |
a character string, indicating the particular cluster algorithm, possible algorithms are "Neighbor-Joining" ("NJ") and "Unweighted Pair Group Method" ("UPGMA"). |
type |
a character string, the graph layout style ('rectangular', 'slanted', 'fan', 'circular', 'radial', 'equal_angle' or 'daylight'). |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). |
a ggtree object.
## Not run: data(BC_dat) plotClusterGgTree(BC_dat, tree_est = "UPGMA", type = "circular") ## End(Not run)
## Not run: data(BC_dat) plotClusterGgTree(BC_dat, tree_est = "UPGMA", type = "circular") ## End(Not run)
Generates a tree plot based on a herachical clustering of the complete distance matrix.
plotClusterTree( BC_dat, tree_est = "NJ", type = "unrooted", tipLabel = FALSE, m = "hamming" )
plotClusterTree( BC_dat, tree_est = "NJ", type = "unrooted", tipLabel = FALSE, m = "hamming" )
BC_dat |
a BCdat object. |
tree_est |
a character string, indicating the particular cluster algorithm, possible algorithms are "Neighbor-Joining" ("NJ") and "Unweighted Pair Group Method" ("UPGMA"). |
type |
a character string, the graph layout style ("unrooted", "phylogram", "cladogram", "fan", "radial"). |
tipLabel |
a logical value, indicating the use of labeled tree leaves. |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). |
plotDistanceIgraph will create a graph-like visualisation (ripple plot) of the corresponding barcode sequences
and their similarity based on the igraph package. The nodes represent the barcode sequences and their
respective size reflects the corresponding read counts. Edges between nodes indicate a distance between two barcodes
of maximal minD
.
If ori_BCs
is provided the node color also refelects the distance of a particular barcode to one of the initial
barcodes.
plotDistanceIgraph( BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, threeD = FALSE, complete = FALSE, col_type = "rainbow", leg_pos = "left", inset = -0.125, title = "Distance", m = "hamming" )
plotDistanceIgraph( BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, threeD = FALSE, complete = FALSE, col_type = "rainbow", leg_pos = "left", inset = -0.125, title = "Distance", m = "hamming" )
BC_dat |
a BCdat object. |
minDist |
an integer value representing the maximal distance value for which the graph will contain edges. |
loga |
a logical value, indicating the use or non-use of logarithmic read count values. |
ori_BCs |
a vector of character strings containing the barcode sequences (without the fixed positions of the barcode construct). |
threeD |
a logical value to chose between 2D and 3D visualisation. |
complete |
a logical value. If TRUE, every node will have at least one edge. |
col_type |
a character sting, choosing one of the available color palettes. |
leg_pos |
a character string, containing the position of the legend (e.g. topleft), if NULL no legend will be plotted |
inset |
a numeric value, specifying the distance from the margins as a fraction of the plot region |
title |
a character string, containing the legend title |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). |
an igraph object.
plotDistanceVisNetwork will create a graph-like visualisation (ripple plot) of the corresponding barcode sequences
and their similarity based on the ggplot2 and the ggnetwork packages. The nodes represent the barcode sequences and their
respective size reflects the corresponding read counts. Edges between nodes indicate a distance between two barcodes
of maximal minDist
.
If ori_BCs
is provided the node color also refelects the distance of a particular barcode to one of the given
barcodes.
plotDistanceVisNetwork( BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, complete = FALSE, col_type = "rainbow", m = "hamming" )
plotDistanceVisNetwork( BC_dat, minDist = 1, loga = TRUE, ori_BCs = NULL, complete = FALSE, col_type = "rainbow", m = "hamming" )
BC_dat |
a BCdat object. |
minDist |
an integer value representing the maximal distance value for which the graph will contain edges. |
loga |
a logical value indicating the use or non-use of logarithmic read count values. |
ori_BCs |
a vector of character strings containing the barcode sequences (without the fixed positions of the barcode construct). |
complete |
a logical value. If TRUE, every node will have at least one edge. |
col_type |
a character sting, choosing one of the available color palettes. |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). |
a visNetwork object.
plotDistanceVisNetwork will create a graph-like visualisation (ripple plot) of the corresponding barcode sequences
and their similarity based on the ggplot2 and the ggnetwork packages. The nodes represent the barcode sequences and their
respective size reflects the corresponding read counts. Edges between nodes indicate a distance between two barcodes
of maximal minDist
.
If ori_BCs
is provided the effects of the error correction function will be color-coded only for those sequences.
plotDistanceVisNetwork_EC( BC_dat, BC_dat_EC, minDist = 1, loga = TRUE, equal_node_sizes = TRUE, BC_threshold = NULL, ori_BCs = NULL, complete = FALSE, col_type = "rainbow", m = "hamming" )
plotDistanceVisNetwork_EC( BC_dat, BC_dat_EC, minDist = 1, loga = TRUE, equal_node_sizes = TRUE, BC_threshold = NULL, ori_BCs = NULL, complete = FALSE, col_type = "rainbow", m = "hamming" )
BC_dat |
a BCdat object. |
BC_dat_EC |
the corresponding error corrected BCdat object (EC_analysis has to be TRUE) |
minDist |
an integer value representing the maximal distance value for which the graph will contain edges. |
loga |
a logical value indicating the use or non-use of logarithmic read count values. |
equal_node_sizes |
a logical value. If TRUE, every node will have the sames size. |
BC_threshold |
an integer value representing the number of barcodes for which the color-coding should be applied (starting with the barcodes with the most read counts). |
ori_BCs |
a vector of character strings containing the barcode sequences (without the fixed positions of the barcode construct). |
complete |
a logical value. If TRUE, every node will have at least one edge. |
col_type |
a character sting, choosing one of the available color palettes. |
m |
a character string, Method for distance calculation, default value is Hamming distance. Possible values are "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex" (see stringdist function of the stringdist-package for more information). |
a visNetwork object.
Creates a plot visualising the nucleotide frequency within the entire fastq file.
plotNucFrequency(source_dir, file_name)
plotNucFrequency(source_dir, file_name)
source_dir |
a character string containing the path to the sequencing file. |
file_name |
a character string containng the name of the sequencing file. |
a ggplot2 object.
Creates a plot of the quality values accommodated by the fastq file.
plotQualityScoreDis(source_dir, file_name, type = "median", rel = FALSE)
plotQualityScoreDis(source_dir, file_name, type = "median", rel = FALSE)
source_dir |
a character string of the path to the source directory. |
file_name |
a character string of the file name. |
type |
a character string, possible values are "mean" and "median". |
rel |
a logical value. If TRUE the y-axis will show relative frequency instead of the absolut counts. |
a ggplot2 object.
## Not run: source_dir <- system.file("extdata", package = "genBaRcode") plotQualityScoreDis(source_dir, file_name = "test_data.fastq", type = "mean") ## End(Not run)
## Not run: source_dir <- system.file("extdata", package = "genBaRcode") plotQualityScoreDis(source_dir, file_name = "test_data.fastq", type = "mean") ## End(Not run)
Visualises the mean, median, 25
plotQualityScorePerCycle(source_dir, file_name)
plotQualityScorePerCycle(source_dir, file_name)
source_dir |
a character string containing the path to the sequencing file. |
file_name |
a character string containng the name of the sequencing file. |
a ggplot2 object.
Generates a barplot visualising the abundances of unique read count frequencies.
plotReadFrequencies( BC_dat, b = 30, bw = NULL, show_it = FALSE, log = FALSE, dens = FALSE )
plotReadFrequencies( BC_dat, b = 30, bw = NULL, show_it = FALSE, log = FALSE, dens = FALSE )
BC_dat |
a BCdat object. |
b |
an integer value, defining the number of bins. Overridden by bw. Defaults to 30. (see '?ggplot2::geom_histogram') |
bw |
an integer value, defining the width of the bins. |
show_it |
a logical vaue. If TRUE, the respective values are printed on the console? |
log |
a logical vaue. If TRUE, the y-axis will be on a log scale. |
dens |
a logical vaue. If TRUE, the density of the read frequencies will be plotted. |
ggplot2 object
data(BC_dat) plotReadFrequencies(BC_dat, b = 10, show_it = TRUE)
data(BC_dat) plotReadFrequencies(BC_dat, b = 10, show_it = TRUE)
Plots a sequence logo
plotSeqLogo(BC_dat, colrs = NULL)
plotSeqLogo(BC_dat, colrs = NULL)
BC_dat |
a chatacter vector or BCdat object containing the respective sequences |
colrs |
a character vector containing the desired colors for the nucleotides A, T, C, G and N (in that order) |
a ggplot2 object
Uses the result of the generateTimeSeriesData function as inout and generates a visualisation of the clonal contributions over a number of given time points (similar to a stacked barplot).
plotTimeSeries( ov_dat, colr = NULL, tp = NULL, x_label = "time", y_label = "contribution" )
plotTimeSeries( ov_dat, colr = NULL, tp = NULL, x_label = "time", y_label = "contribution" )
ov_dat |
a numeric matrix consisting of all time points as columns and all barcode sequences as rows and the corresponding read counts as numerical values (see function |
colr |
a vector of character strings identifying a certain color palette. |
tp |
a numeric vector containing the time points of measurement (in case of unequally distributed time points). |
x_label |
a character string providing the x-axis label. |
y_label |
a character string providing the y-axis label. |
a ggplot2 object.
ov_dat <- matrix(round(runif(1:100, min = 0, max = 1000)), ncol = 5) rownames(ov_dat) <- paste("barcode", 1:20) plotTimeSeries(ov_dat)
ov_dat <- matrix(round(runif(1:100, min = 0, max = 1000)), ncol = 5) rownames(ov_dat) <- paste("barcode", 1:20) plotTimeSeries(ov_dat)
plotVennDiagramm will create a Venn Diagram ans is based on the VennDiagram package. It accepts a list of BCdat objects and will return a ggplot2 output object.
plotVennDiagram( BC_dat, alpha_value = 0.4, colrs = NA, border_color = NA, plot_title = "", legend_sort = NULL, annotationSize = 5 )
plotVennDiagram( BC_dat, alpha_value = 0.4, colrs = NA, border_color = NA, plot_title = "", legend_sort = NULL, annotationSize = 5 )
BC_dat |
a list of BCdat objects. |
alpha_value |
color transparency value [0-1]. |
colrs |
a character vector containing the desired colors, if NA the colors will be chosen automatically. |
border_color |
a character value specifying the desired border color, if NA no border will be drawn. |
plot_title |
a character value. |
legend_sort |
a character or factor vector in case the order of legend items needs to be changed. |
annotationSize |
an integer value specifying the venn diagramm internal text size. |
ggplot2 object.
generates BCdat object after barcode backbone identification.
prepareDatObject(dat, results_dir, label, bc_backbone, min_reads, save_it)
prepareDatObject(dat, results_dir, label, bc_backbone, min_reads, save_it)
dat |
a tbl_df object (e.g. created by dplyr::count) |
results_dir |
a character string which contains the path to the results directory. |
label |
a character string which serves as a label for every kind of created output file. |
bc_backbone |
a character string describing the barcode design, variable positions have to be marked with the letter 'N'. |
min_reads |
positive integer value, all extracted barcode sequences with a read count smaller than min_reads will be excluded from the results |
save_it |
a logical value. If TRUE, the raw data will be saved as a csv-file. |
a BCdat object.
Reads the corresponding fast(a/q) file(s), extracts the defined barcode constructs and counts them. Optionally, a Phred-Score based quality filtering will be conducted and the results will be saved within a csv file.
processingRawData( file_name, source_dir, results_dir = NULL, mismatch = 0, indels = FALSE, label = "", bc_backbone, bc_backbone_label = NULL, min_score = 30, min_reads = 2, save_it = TRUE, seqLogo = FALSE, cpus = 1, strategy = "sequential", full_output = FALSE, wobble_extraction = TRUE, dist_measure = "hamming" )
processingRawData( file_name, source_dir, results_dir = NULL, mismatch = 0, indels = FALSE, label = "", bc_backbone, bc_backbone_label = NULL, min_score = 30, min_reads = 2, save_it = TRUE, seqLogo = FALSE, cpus = 1, strategy = "sequential", full_output = FALSE, wobble_extraction = TRUE, dist_measure = "hamming" )
file_name |
a character string or a character vector, containing the file name(s). |
source_dir |
a character string which contains the path to the source files. |
results_dir |
a character string which contains the path to the results directory. If no value is assigned the source_dir will automatically also become the results_dir. |
mismatch |
an positive integer value, default is 0, if greater values are provided they indicate the number of allowed mismtaches when identifying the barcode constructes. |
indels |
a logical value. If TRUE the chosen number of mismatches will be interpreted as edit distance and allow for insertions and deletions as well (currently under construction). |
label |
a character string which serves as a label for every kind of created output file. |
bc_backbone |
a character string describing the barcode design, variable positions have to be marked with the letter 'N'. If only a clustering of the sequenced reads should be applied bc_backbone is expecting the string "none" and the mismatch parameter will then be interpreted as maximum dissimilarity for which two reads will be clustered together. |
bc_backbone_label |
a character vector, an optional list of barcode backbone names serving as additional identifier within file names and BCdat labels. If not provided ordinary numbers will serve as alternative. |
min_score |
a positive integer value, all fastq sequence with an average score smaller then min_score will be excluded, if min_score = 0 there will be no quality score filtering |
min_reads |
positive integer value, all extracted barcode sequences with a read count smaller than min_reads will be excluded from the results |
save_it |
a logical value. If TRUE, the raw data will be saved as a csv-file. |
seqLogo |
a logical value. If TRUE, the sequence logo of the entire NGS file will be generated and saved. |
cpus |
an integer value, indicating the number of available cpus. |
strategy |
since the future package is used for parallelisation a strategy has to be stated, the default is "sequential" (cpus = 1) and "multisession" (cpus > 1). For further information please read future::plan() R-Documentation. |
full_output |
a logical value. If TRUE, additional output files will be generated. |
wobble_extraction |
a logical value. If TRUE, single reads will be stripped of the backbone and only the "wobble" positions will be left. |
dist_measure |
a character value. If "bc_backbone = 'none'", single reads will be clustered based on a distance measure. Available distance methods are Optimal string aligment ("osa"), Levenshtein ("lv"), Damerau-Levenshtein ("dl"), Hamming ("hamming"), Longest common substring ("lcs"), q-gram ("qgram"), cosine ("cosine"), Jaccard ("jaccard"), Jaro-Winkler ("jw"), distance based on soundex encoding ("soundex"). For more detailed information see stringdist function of the stringdist-package for more information) |
a BCdat object which will include read counts, barcode sequences, the results directory and the search barcode backbone.
## Not run: bc_backbone <- "ACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANN" source_dir <- system.file("extdata", package = "genBaRcode") BC_dat <- processingRawData(file_name = "test_data.fastq.gz", source_dir, results_dir = "/my/test/directory/", mismatch = 2, label = "test", bc_backbone, min_score = 30, indels = FALSE, min_reads = 2, save_it = FALSE, seqLogo = FALSE) ## End(Not run)
## Not run: bc_backbone <- "ACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANN" source_dir <- system.file("extdata", package = "genBaRcode") BC_dat <- processingRawData(file_name = "test_data.fastq.gz", source_dir, results_dir = "/my/test/directory/", mismatch = 2, label = "test", bc_backbone, min_score = 30, indels = FALSE, min_reads = 2, save_it = FALSE, seqLogo = FALSE) ## End(Not run)
Excludes all sequences of a given fastq file below a certain quality value.
qualityFiltering(file_name, source_dir, min_score = 30)
qualityFiltering(file_name, source_dir, min_score = 30)
file_name |
a character string containing the name of the source file. |
source_dir |
a character string containing the path to the source directory. |
min_score |
an integer value representing the minimal average phred score a read has to achieve in order to be accepted. |
a ShortRead object.
## Not run: source_dir <- system.file("extdata", package = "genBaRcode") qualityFiltering(file_name = "test_data.fastq.gz", source_dir, results_dir = getwd(), min_score = 30) ## End(Not run)
## Not run: source_dir <- system.file("extdata", package = "genBaRcode") qualityFiltering(file_name = "test_data.fastq.gz", source_dir, results_dir = getwd(), min_score = 30) ## End(Not run)
Reads a data table (csv-file) and returns a BCdat objects.
readBCdat(path, label = "", BC_backbone = "", file_name, s = ";")
readBCdat(path, label = "", BC_backbone = "", file_name, s = ";")
path |
a character string containing the path to a saved read count table (two columns containing read counts and barcode sequences). |
label |
a character string containing a label of the data set. |
BC_backbone |
a character string containing the barcode structure information. |
file_name |
a character string containing the name of the file to read in. |
s |
a character value, identifying the column separating char. |
a BCdat object.
Replacing the Barcode Backbone slot of a BCdat objects.
setBackbone(object, value)
setBackbone(object, value)
object |
a BCdat object. |
value |
a character string consisting of exclusively IUPAC-nucleotide-code conform letters. |
a BCdat object.
data(BC_dat) new_backbone <- getBackboneSelection("BC32-T-Sapphire") BC_dat_new <- setBackbone(BC_dat, new_backbone)
data(BC_dat) new_backbone <- getBackboneSelection("BC32-T-Sapphire") BC_dat_new <- setBackbone(BC_dat, new_backbone)
Replacing the Label slot of a BCdat objects.
setLabel(object, value)
setLabel(object, value)
object |
a BCdat object. |
value |
a character string. |
a BCdat object.
data(BC_dat) new_label <- "foo-bar" BC_dat_new <- setLabel(BC_dat, new_label)
data(BC_dat) new_label <- "foo-bar" BC_dat_new <- setLabel(BC_dat, new_label)
Replacing the Read-Count slot of a BCdat objects.
setReads(object, value)
setReads(object, value)
object |
a BCdat object. |
value |
a data.frame caontaining two columns called "read_count" and "barcode". |
a BCdat object.
data(BC_dat) require("dplyr") bcs <- unlist(lapply(1:20, function(x) { c("A", "C", "T", "G") %>% sample(replace = TRUE, size = 32) %>% paste0(collapse = "") })) new_read_count_table <- data.frame(read_count = sample(1:1000, size = 20), barcode = bcs) BC_dat_new <- setReads(BC_dat, new_read_count_table)
data(BC_dat) require("dplyr") bcs <- unlist(lapply(1:20, function(x) { c("A", "C", "T", "G") %>% sample(replace = TRUE, size = 32) %>% paste0(collapse = "") })) new_read_count_table <- data.frame(read_count = sample(1:1000, size = 20), barcode = bcs) BC_dat_new <- setReads(BC_dat, new_read_count_table)
Replacing the Results Directory slot of a BCdat objects.
setResultsDir(object, value)
setResultsDir(object, value)
object |
a BCdat object. |
value |
a character string of an existing path. |
a BCdat object.
data(BC_dat) new_path <- getwd() BC_dat_new <- setResultsDir(BC_dat, new_path)
data(BC_dat) new_path <- getwd() BC_dat_new <- setResultsDir(BC_dat, new_path)