ISAnalytics 1.6.2
ISAnalytics is an R package developed to analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies.
In this vignette we will explain how to properly setup the worflow and the first steps of data import and data cleaning.
ISAnalytics can be installed quickly in different ways:
devtoolsThere are always 2 versions of the package active:
RELEASE is the latest stable versionDEVEL is the development version, it is the most up-to-date version where
all new features are introducedRELEASE version:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ISAnalytics")DEVEL version:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("ISAnalytics")RELEASE:
if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "RELEASE_3_15",
                         dependencies = TRUE,
                         build_vignettes = TRUE)DEVEL:
if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "master",
                         dependencies = TRUE,
                         build_vignettes = TRUE)ISAnalytics has a verbose option that allows some functions to print
additional information to the console while they’re executing.
To disable this feature do:
# DISABLE
options("ISAnalytics.verbose" = FALSE)
# ENABLE
options("ISAnalytics.verbose" = TRUE)
Some functions also produce report in a user-friendly HTML format, to set this feature:
# DISABLE HTML REPORTS
options("ISAnalytics.reports" = FALSE)
# ENABLE HTML REPORTS
options("ISAnalytics.reports" = TRUE)This section demonstrates how to properly setup your workflow
with ISAnalytics using the “dynamic vars” system.
From ISAnalytics 1.5.4 onwards, a new system here referred to as
“dynamic vars” has been implemented to improve the flexibility of the package,
by allowing multiple input formats based on user needs rather than enforcing
hard-coded names and structures. In this way, users that do not follow the
standard name conventions used by the package have to put minimal effort
into making their inputs compliant to the package requirements.
There are 5 main categories of inputs you can customize:
The general approach is based on the specification of predefined tags and their associated information in the form of simple data frames with a standard structure, namely:
| names | types | transform | flag | tag | 
|---|---|---|---|---|
| <name of the column> | <type> | <a lambda or NULL> | <flag> | <tag> | 
where
names contains the name of the column as a charactertypes contains the type of the column. Type should be expressed as a
string and should be in one of the allowed types
char for character (strings)int for integerslogi for logical values (TRUE / FALSE)numeric for numeric valuesfactor for factorsdate for generic date format - note that functions that
need to read and parse files will try to guess the format and parsing
may failISAnalytics::date_formats() to view the accepted formatstransform: a purrr-style lambda that is applied immediately after importing.
This is useful to operate simple transformations like removing unwanted
characters or rounding to a certain precision. Please note that these lambdas
need to be functions that accept a vector as input and only operate a
transformation, aka they output a vector of the same length as the
input. For more complicated applications that may require the value of other
columns, appropriate functions should be manually applied post-import.flag: as of now, it should be set either to required or optional -
some functions internally check for only required tags presence and if those
are missing from inputs they fail, signaling failure to the usertag: a specific tag expressed as a string - see Section 3.2Dynamic variables general approach
For each category of dynamic vars there are 3 functions:
Setters will take in input the new variables, validate and eventually change the lookup table. If validation fails an error will be thrown instead, inviting the user to review the inputs. Moreover, if some of the critical tags for the category are missing, a warning appears, with a list of the missing ones.
Let’s take a look at some examples.
On package loading, all lookup tables are set to default values. For example, for mandatory IS vars we have:
mandatory_IS_vars(TRUE)
#> # A tibble: 3 × 5
#>   names             types transform flag     tag       
#>   <chr>             <chr> <list>    <chr>    <chr>     
#> 1 chr               char  <NULL>    required chromosome
#> 2 integration_locus int   <NULL>    required locus     
#> 3 strand            char  <NULL>    required is_strandLet’s suppose our matrices follow a different standard, and integration events are characterized by 5 fields, like so (the example contains random data):
| chrom | position | strand | gap | junction | 
|---|---|---|---|---|
| “chr1” | 342543 | “+” | 100 | 50 | 
| … | … | … | … | … | 
To make this work with ISAnalytics functions, we need to compile the lookup table like this:
new_mand_vars <- tibble::tribble(
  ~names, ~types, ~transform, ~flag, ~tag,
  "chrom", "char", ~stringr::str_replace_all(.x, "chr", ""), "required",
  "chromosome",
  "position", "int", NULL, "required", "locus",
  "strand", "char", NULL, "required", "is_strand",
  "gap", "int", NULL, "required", NA_character_,
  "junction", "int", NULL, "required", NA_character_
)Notice that we have specified a transformation for the “chromosome” tag: in this case we would like to have only the number of the chromosome without the prefix “chr” - this lambda will get executed immediately after import.
To set the new variables simply do:
set_mandatory_IS_vars(new_mand_vars)
#> Mandatory IS vars successfully changed
mandatory_IS_vars(TRUE)
#> # A tibble: 5 × 5
#>   names    types transform flag     tag       
#>   <chr>    <chr> <list>    <chr>    <chr>     
#> 1 chrom    char  <formula> required chromosome
#> 2 position int   <NULL>    required locus     
#> 3 strand   char  <NULL>    required is_strand 
#> 4 gap      int   <NULL>    required <NA>      
#> 5 junction int   <NULL>    required <NA>If you don’t specify a critical tag, a warning message is displayed:
new_mand_vars[1, ]$tag <- NA_character_
set_mandatory_IS_vars(new_mand_vars)
#> Warning: important tags missing
#> ℹ Some tags are required for proper execution of some functions. If these tags are not provided, execution of dependent functions might fail. Review your inputs carefully.
#> ℹ Missing tags: chromosome
#> ℹ To see where these are involved type `inspect_tags(c('chromosome'))`
#> Mandatory IS vars successfully changed
mandatory_IS_vars(TRUE)
#> # A tibble: 5 × 5
#>   names    types transform flag     tag      
#>   <chr>    <chr> <list>    <chr>    <chr>    
#> 1 chrom    char  <formula> required <NA>     
#> 2 position int   <NULL>    required locus    
#> 3 strand   char  <NULL>    required is_strand
#> 4 gap      int   <NULL>    required <NA>     
#> 5 junction int   <NULL>    required <NA>If you change your mind and want to go back to defaults:
reset_mandatory_IS_vars()
#> Mandatory IS vars reset to default
mandatory_IS_vars(TRUE)
#> # A tibble: 3 × 5
#>   names             types transform flag     tag       
#>   <chr>             <chr> <list>    <chr>    <chr>     
#> 1 chr               char  <NULL>    required chromosome
#> 2 integration_locus int   <NULL>    required locus     
#> 3 strand            char  <NULL>    required is_strandThe principle is the same for annotation IS vars, association file columns and VISPA2 stats specs. Here is a summary of the functions for each:
mandatory_IS_vars(), set_mandatory_IS_vars(),
reset_mandatory_IS_vars()annotation_IS_vars(), set_annotation_IS_vars(),
reset_annotation_IS_vars()association_file_columns(),
set_af_columns_def(), reset_af_columns_def()iss_stats_specs(), set_iss_stats_specs(),
reset_iss_stats_specsMatrix files suffixes work slightly different:
matrix_file_suffixes()
#> # A tibble: 10 × 3
#>    quantification   matrix_type   file_suffix                                 
#>    <chr>            <chr>         <glue>                                      
#>  1 seqCount         annotated     seqCount_matrix.no0.annotated.tsv.gz        
#>  2 fragmentEstimate annotated     fragmentEstimate_matrix.no0.annotated.tsv.gz
#>  3 barcodeCount     annotated     barcodeCount_matrix.no0.annotated.tsv.gz    
#>  4 cellCount        annotated     cellCount_matrix.no0.annotated.tsv.gz       
#>  5 ShsCount         annotated     ShsCount_matrix.no0.annotated.tsv.gz        
#>  6 seqCount         not_annotated seqCount_matrix.tsv.gz                      
#>  7 fragmentEstimate not_annotated fragmentEstimate_matrix.tsv.gz              
#>  8 barcodeCount     not_annotated barcodeCount_matrix.tsv.gz                  
#>  9 cellCount        not_annotated cellCount_matrix.tsv.gz                     
#> 10 ShsCount         not_annotated ShsCount_matrix.tsv.gzTo change this lookup table use the function set_matrix_file_suffixes():
the function will ask to specify a suffix for each quantification and for
both annotated and not annotated versions. These suffixes are used in the
automated matrix import function when scanning the file system.
To reset all lookup tables to their default configurations you can also
use the function reset_dyn_vars_config(), which reverts all changes.
At the moment yes, but it is just temporary, so we suggest to save a script with the necessary configurations and execute it before proceeding with the analyses. In future updates we will add the possibility to export and import configurations. Stay tuned!
ISAnalytics import functions familyIn this section we’re going to explain more in detail how functions of the import family should be used, the most common workflows to follow and more.
The vast majority of the functions included in this package is designed to work in combination with VISPA2 pipeline (Giulio Spinozzi Andrea Calabria, 2017). If you don’t know what it is, we strongly recommend you to take a look at these links:
VISPA2 produces a standard file system structure starting from a folder you specify as your workbench or root. The structure always follows this schema:
Most of the functions implemented expect a standard file system structure as the one described above.
We call an “integration matrix” a tabular structure characterized by:
mandatory_IS_vars(). By default
they’re set to chr, integration_locus and strandannotation_IS_vars().
By default they’re set to GeneName and GeneStrand#> # A tibble: 3 × 8
#>   chr   integration_locus strand GeneName     GeneStrand  exp1  exp2  exp3
#>   <chr>             <dbl> <chr>  <chr>        <chr>      <dbl> <dbl> <dbl>
#> 1 1                 12324 +      NFATC3       +           4553  5345    NA
#> 2 6                657532 +      LOC100507487 +             76   545     5
#> 3 7                657532 +      EDIL3        -             NA    56    NAThe package uses a more compact form of these matrices, limiting the amount of NA values and optimizing time and memory consumption. For more info on this take a look at: Tidy data
While integration matrices contain the actual data, we also need associated
sample metadata to perform the vast majority of the analyses.
ISAnalytics expects the metadata to be contained in a so called
“association file”, which is a simple tabular file.
To generate a blank association file you can use the function
generate_blank_association_file. You can also view the standard
column names with association_file_columns().
To import metadata we use import_association_file(). This function is not
only responsible for reading the file into the R environment as a data frame,
but it is capable to perform a file system alignment operation,
that is, for each project and pool contained in the file, it scans
the file system starting from the provided root to check if the corresponding
folders (contained in the appropriate column) can be found. Remember that
to work properly, this operation expects a standard folder structure, such
as the one provided by VISPA2. This function also produces an interactive
HTML report.
fs_path <- generate_default_folder_structure()
withr::with_options(list(ISAnalytics.reports = FALSE), code = {
  af <- import_association_file(fs_path$af, root = fs_path$root)
})
#> *** Association file import summary ***
#> ℹ For detailed report please set option 'ISAnalytics.reports' to TRUE
#> Parsing problems detected: FALSE
#> Date parsing problems: FALSE
#> Column problems detected: FALSE
#> NAs found in important columns: FALSE
#> File system alignment: no problems detected#>    ProjectID  FUSIONID PoolID TagSequence SubjectID VectorType VectorID ExperimentID Tissue TimePoint DNAFragmentation
#> 1:      PJ01 ET#382.46 POOL01   LTR75LC38     PT001      lenti    GLOBE         <NA>     PB      0060            SONIC
#> 2:      PJ01 ET#381.40 POOL01   LTR53LC32     PT001      lenti    GLOBE         <NA>     BM      0180            SONIC
#> 3:      PJ01  ET#381.9 POOL01   LTR83LC66     PT001      lenti    GLOBE         <NA>     BM      0180            SONIC
#> 4:      PJ01 ET#381.71 POOL01   LTR27LC94     PT001      lenti    GLOBE         <NA>     BM      0180            SONIC
#> 5:      PJ01  ET#381.2 POOL01   LTR69LC52     PT001      lenti    GLOBE         <NA>     PB      0180            SONIC
#> 6:      PJ01 ET#382.28 POOL01    LTR37LC2     PT001      lenti    GLOBE         <NA>     BM      0060            SONIC
#>    PCRMethod TagIDextended Keywords CellMarker      TagID NGSProvider NGSTechnology ConverrtedFilesDir
#> 1:      SLiM     LTR75LC38     <NA>        MNC LTR75.LC38        <NA>         HiSeq               <NA>
#> 2:      SLiM     LTR53LC32     <NA>        MNC LTR53.LC32        <NA>         HiSeq               <NA>
#> 3:      SLiM     LTR83LC66     <NA>        MNC LTR83.LC66        <NA>         HiSeq               <NA>
#> 4:      SLiM     LTR27LC94     <NA>        MNC LTR27.LC94        <NA>         HiSeq               <NA>
#> 5:      SLiM     LTR69LC52     <NA>        MNC LTR69.LC52        <NA>         HiSeq               <NA>
#> 6:      SLiM      LTR37LC2     <NA>        MNC  LTR37.LC2        <NA>         HiSeq               <NA>
#>    ConverrtedFilesName SourceFileFolder SourceFileNameR1 SourceFileNameR2 DNAnumber ReplicateNumber DNAextractionDate
#> 1:                <NA>             <NA>             <NA>             <NA> PT001-103               3        2016-03-16
#> 2:                <NA>             <NA>             <NA>             <NA>  PT001-81               2        2016-07-15
#> 3:                <NA>             <NA>             <NA>             <NA>  PT001-81               1        2016-07-15
#> 4:                <NA>             <NA>             <NA>             <NA>  PT001-81               3        2016-07-15
#> 5:                <NA>             <NA>             <NA>             <NA>  PT001-74               1        2016-07-15
#> 6:                <NA>             <NA>             <NA>             <NA> PT001-107               2        2016-03-16
#>    DNAngUsed LinearPCRID LinearPCRDate SonicationDate LigationDate 1stExpoPCRID 1stExpoPCRDate 2ndExpoID 2ndExpoDate
#> 1:    23.184        <NA>          <NA>     2016-11-02   2016-11-02    ET#380.46     2016-11-02      <NA>        <NA>
#> 2:   181.440        <NA>          <NA>     2016-11-02   2016-11-02    ET#379.40     2016-11-02      <NA>        <NA>
#> 3:   181.440        <NA>          <NA>     2016-11-02   2016-11-02     ET#379.9     2016-11-02      <NA>        <NA>
#> 4:   181.440        <NA>          <NA>     2016-11-02   2016-11-02    ET#379.71     2016-11-02      <NA>        <NA>
#> 5:    23.058        <NA>          <NA>     2016-11-02   2016-11-02     ET#379.2     2016-11-02      <NA>        <NA>
#> 6:   171.360        <NA>          <NA>     2016-11-02   2016-11-02    ET#380.28     2016-11-02      <NA>        <NA>
#>    FusionPrimerPCRID FusionPrimerPCRDate   PoolDate SequencingDate  VCN Genome SequencingRound Genotype TestGroup  MOI
#> 1:         ET#382.46          2016-11-03 2016-11-07     2016-11-15 0.30   hg19               1     <NA>      <NA> <NA>
#> 2:         ET#381.40          2016-11-03 2016-11-07     2016-11-15 0.27   hg19               1     <NA>      <NA> <NA>
#> 3:          ET#381.9          2016-11-03 2016-11-07     2016-11-15 0.27   hg19               1     <NA>      <NA> <NA>
#> 4:         ET#381.71          2016-11-03 2016-11-07     2016-11-15 0.27   hg19               1     <NA>      <NA> <NA>
#> 5:          ET#381.2          2016-11-03 2016-11-07     2016-11-15 0.24   hg19               1     <NA>      <NA> <NA>
#> 6:         ET#382.28          2016-11-03 2016-11-07     2016-11-15 0.42   hg19               1     <NA>      <NA> <NA>
#>    Engraftment Transduction Notes AddedField1 AddedField2 AddedField3 AddedField4 concatenatePoolIDSeqRun
#> 1:          NA           NA  <NA>        <NA>        <NA>        <NA>        <NA>                POOL01-1
#> 2:          NA           NA  <NA>        <NA>        <NA>        <NA>        <NA>                POOL01-1
#> 3:          NA           NA  <NA>        <NA>        <NA>        <NA>        <NA>                POOL01-1
#> 4:          NA           NA  <NA>        <NA>        <NA>        <NA>        <NA>                POOL01-1
#> 5:          NA           NA  <NA>        <NA>        <NA>        <NA>        <NA>                POOL01-1
#> 6:          NA           NA  <NA>        <NA>        <NA>        <NA>        <NA>                POOL01-1
#>    AddedField6_RelativeBloodPercentage AddedField7_PurityTestFeasibility AddedField8_FacsSeparationPurity Kapa
#> 1:                                <NA>                              <NA>                             <NA>   NA
#> 2:                                <NA>                              <NA>                             <NA>   NA
#> 3:                                <NA>                              <NA>                             <NA>   NA
#> 4:                                <NA>                              <NA>                             <NA>   NA
#> 5:                                <NA>                              <NA>                             <NA>   NA
#> 6:                                <NA>                              <NA>                             <NA>   NA
#>    ulForPool                                              CompleteAmplificationID               UniqueID StudyTestID
#> 1:        NA PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC ID00000000000000007433        <NA>
#> 2:        NA  PJ01_POOL01_LTR53LC32_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC ID00000000000000007340        <NA>
#> 3:        NA  PJ01_POOL01_LTR83LC66_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC ID00000000000000007310        <NA>
#> 4:        NA  PJ01_POOL01_LTR27LC94_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC ID00000000000000007370        <NA>
#> 5:        NA  PJ01_POOL01_LTR69LC52_PT001_PT001-74_lenti_GLOBE_PB_1_SLiM_0180_MNC ID00000000000000007303        <NA>
#> 6:        NA  PJ01_POOL01_LTR37LC2_PT001_PT001-107_lenti_GLOBE_BM_1_SLiM_0060_MNC ID00000000000000007417        <NA>
#>    StudyTestGroup MouseID Tigroup Tisource PathToFolderProjectID SamplesNameCheck TimepointDays TimepointMonths
#> 1:           <NA>    <NA>    <NA>     <NA>                 /PJ01             <NA>          0060              02
#> 2:           <NA>    <NA>    <NA>     <NA>                 /PJ01             <NA>          0180              06
#> 3:           <NA>    <NA>    <NA>     <NA>                 /PJ01             <NA>          0180              06
#> 4:           <NA>    <NA>    <NA>     <NA>                 /PJ01             <NA>          0180              06
#> 5:           <NA>    <NA>    <NA>     <NA>                 /PJ01             <NA>          0180              06
#> 6:           <NA>    <NA>    <NA>     <NA>                 /PJ01             <NA>          0060              02
#>    TimepointYears ng DNA corrected                    Path                                      Path_quant
#> 1:             01            23.18 /tmp/RtmpsFsM2B/fs/PJ01 /tmp/RtmpsFsM2B/fs/PJ01/quantification/POOL01-1
#> 2:             01           181.44 /tmp/RtmpsFsM2B/fs/PJ01 /tmp/RtmpsFsM2B/fs/PJ01/quantification/POOL01-1
#> 3:             01           181.44 /tmp/RtmpsFsM2B/fs/PJ01 /tmp/RtmpsFsM2B/fs/PJ01/quantification/POOL01-1
#> 4:             01           181.44 /tmp/RtmpsFsM2B/fs/PJ01 /tmp/RtmpsFsM2B/fs/PJ01/quantification/POOL01-1
#> 5:             01            23.06 /tmp/RtmpsFsM2B/fs/PJ01 /tmp/RtmpsFsM2B/fs/PJ01/quantification/POOL01-1
#> 6:             01           171.36 /tmp/RtmpsFsM2B/fs/PJ01 /tmp/RtmpsFsM2B/fs/PJ01/quantification/POOL01-1
#>                                Path_iss
#> 1: /tmp/RtmpsFsM2B/fs/PJ01/iss/POOL01-1
#> 2: /tmp/RtmpsFsM2B/fs/PJ01/iss/POOL01-1
#> 3: /tmp/RtmpsFsM2B/fs/PJ01/iss/POOL01-1
#> 4: /tmp/RtmpsFsM2B/fs/PJ01/iss/POOL01-1
#> 5: /tmp/RtmpsFsM2B/fs/PJ01/iss/POOL01-1
#> 6: /tmp/RtmpsFsM2B/fs/PJ01/iss/POOL01-1You can change several arguments in the function call to modify the behavior of the function.
root
NULL if you only want to import the association file without
file system alignment. Beware that some of the automated import
functionalities won’t work!proj_folder
(by default PathToFolderProjectID) in the file should contain
relative file paths, so if for example your root is set to “/home” and
your project folder in the association file is set to “/PJ01”, the function
will check that the directory exists under “/home/PJ01”PathToFolderProjectID column and set root = ""dates_format: a string that is useful for properly parsing dates from
tabular formatsseparator: the column separator used in the file. Defaults to “\t”,
other valid separators are “,” (comma), “;” (semi-colon)filter_for: you can set this argument to a named list of filters,
where names are column names. For example list(ProjectID = "PJ01") will
return only those rows whose attribute “ProjectID” equals “PJ01”import_iss: either TRUE or FALSE. If set to TRUE, performs
an internal call to import_Vispa2_stats() (see next section), and appends
the imported files to metadataconvert_tp: either TRUE or FALSE. Converts the column containing
the time point expressed in days in months and years (with custom logic).report_path
NULL to avoid the production of a report...: additional named arguments to pass to import_Vispa2_stats() if
you chose to import VISPA2 statsFor further details view the dedicated function documentation.
NOTE: the function supports files in various formats as long as the correct
separator is provided. It also accepts files in *.xlsx and *.xls formats
but we do not recommend using these since the report won’t include a
detailed summary of potential parsing problems.
The interactive report includes useful information such as
import_iss was TRUE)VISPA2 automatically produces summary files for each pool holding
information that can be useful for other analyses downstream,
so it is recommended to import them in the first steps of the workflow.
To do that, you can use import_VISPA2_stats:
withr::with_options(list(ISAnalytics.reports = FALSE), {
    vispa_stats <- import_Vispa2_stats(
        association_file = af,
        join_with_af = FALSE
    )
})#> # A tibble: 6 × 14
#>   POOL     TAG       RUN_NAME      PHIX_MAPPING PLASMID_MAPPED_BY… BARCODE_MUX LTR_IDENTIFIED TRIMMING_FINAL_… LV_MAPPED
#>   <chr>    <chr>     <chr>                <int>              <int>       <int>          <int>            <int>     <int>
#> 1 POOL01-1 LTR75LC38 PJ01|POOL01-1     43586699            2256176      645026         645026           630965    211757
#> 2 POOL01-1 LTR53LC32 PJ01|POOL01-1     43586699            2256176      652208         652177           649044    303300
#> 3 POOL01-1 LTR83LC66 PJ01|POOL01-1     43586699            2256176      451519         451512           449669    204810
#> 4 POOL01-1 LTR27LC94 PJ01|POOL01-1     43586699            2256176      426500         426499           425666    185752
#> 5 POOL01-1 LTR69LC52 PJ01|POOL01-1     43586699            2256176       18300          18300            18290      6962
#> 6 POOL01-1 LTR37LC2  PJ01|POOL01-1     43586699            2256176      729327         729327           727219    318653
#> # … with 5 more variables: BWA_MAPPED_OVERALL <int>, ISS_MAPPED_OVERALL <int>, RAW_READS <lgl>, QUALITY_PASSED <lgl>,
#> #   ISS_MAPPED_PP <lgl>The function requires as input the imported and file system aligned
association file and it will scan the iss folder for files that match some
known prefixes (defaults are already provided but you can change them as you
see fit). You can either choose to join the imported data frames with the
association file in input and obtain a single data frame or keep it as it is,
just set the parameter join_with_af accordingly.
At the end of the process an HTML report is produced, signaling potential
problems.
You can directly call this function when you import the association file
by setting the import_iss argument of import_association_file to TRUE.
If you want to import a single integration matrix you can do so by using the
import_single_Vispa2Matrix() function.
This function reads the file and converts it into a tidy structure: several
different formats can be read, since you can specify the column separator.
matrix_path <- fs::path(fs_path$root,
                        "PJ01",
                        "quantification",
                        "POOL01-1",
                        "PJ01_POOL01-1_seqCount_matrix.no0.annotated.tsv.gz")
matrix <- import_single_Vispa2Matrix(matrix_path)#>      chr integration_locus strand     GeneName GeneStrand
#>   1:  16          68164148      +       NFATC3          +
#>   2:   4         129390130      + LOC100507487          +
#>   3:   5          84009671      -        EDIL3          -
#>   4:  12          54635693      -         CBX5          -
#>   5:   2         181930711      +       UBE2E3          +
#>  ---                                                     
#> 798:  11          72729902      -       FCHSD2          -
#> 799:  12         133238196      +         POLE          -
#> 800:  20          10028583      -       ANKEF1          +
#> 801:  12          50594351      +        LIMA1          -
#> 802:   5          84009671      -        EDIL3          -
#>                                                   CompleteAmplificationID Value
#>   1: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC   182
#>   2: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC     4
#>   3: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC     5
#>   4: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC     9
#>   5: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC    83
#>  ---                                                                           
#> 798: PJ01_POOL01_LTR29LC90_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC     1
#> 799: PJ01_POOL01_LTR29LC90_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC     2
#> 800: PJ01_POOL01_LTR29LC90_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC     8
#> 801:  PJ01_POOL01_LTR53LC22_PT001_PT001-93_lenti_GLOBE_PB_1_SLiM_0030_MNC     1
#> 802:   PJ01_POOL01_LTR9LC80_PT001_PT001-74_lenti_GLOBE_PB_1_SLiM_0180_MNC     1For details on usage and arguments view the dedicated function documentation.
Integration matrices import can be automated when when the association file
is imported with the file system alignment option.
ISAnalytics provides a function, import_parallel_Vispa2Matrices(),
that allows to do just that in a fast and efficient way.
withr::with_options(list(ISAnalytics.reports = FALSE), {
    matrices <- import_parallel_Vispa2Matrices(af,
        c("seqCount", "fragmentEstimate"),
        mode = "AUTO"
    )
})
#> 
  |                                                                                                                    
  |                                                                                                              |   0%
  |                                                                                                                    
  |============================                                                                                  |  25%
  |                                                                                                                    
  |=======================================================                                                       |  50%
  |                                                                                                                    
  |==================================================================================                            |  75%
  |                                                                                                                    
  |==============================================================================================================| 100%
#> 
#> 
  |                                                                                                                    
  |                                                                                                              |   0%
  |                                                                                                                    
  |============================                                                                                  |  25%
  |                                                                                                                    
  |=======================================================                                                       |  50%
  |                                                                                                                    
  |==================================================================================                            |  75%
  |                                                                                                                    
  |==============================================================================================================| 100%Let’s see how the behavior of the function changes when we change arguments.
association_file argumentYou can supply a data frame object, imported via import_association_file()
(see Section 4.4) or a string (the path to the association file
on disk). In the first scenario it is necessary to perform file system
alignment, since the function scans the folders contained in the column
Path_quant, while in the second case you should also provide as additional
named argument (to ...) an appropriate root: the function will
internally call import_association_file(), if you don’t have specific
needs we recommend doing the 2 steps separately and provide the association
file as a data frame.
quantification_type argumentFor each pool there may be multiple available quantification types, that is,
different matrices containing the same samples
and same genomic features but a different quantification.
A typical workflow contemplates seqCount and fragmentEstimate,
all the supported quantification types can be viewed with
quantification_types().
matrix_type argumentAs we mentioned in Section 4.3, annotation columns are optional
and may not be included in some matrices. This argument allows you to
specify the function to look for only a specific type of matrix, either
annotated or not_annotated.
File suffixes for matrices are specified via matrix_file_suffixes().
workers argumentSets the number of parallel workers to set up. This highly depends on the hardware configuration of your machine.
multi_quant_matrix argumentWhen importing more than one quantification at once, it can be very handy
to have all data in a single data frame rather than two. If set to TRUE
the function will internally call comparison_matrix() and produce a
single data frames that has a dedicated column for each quantification.
For example, for the matrices we’ve imported before:
#>    chr integration_locus strand     GeneName GeneStrand
#> 1:  16          68164148      +       NFATC3          +
#> 2:   4         129390130      + LOC100507487          +
#> 3:   5          84009671      -        EDIL3          -
#> 4:  12          54635693      -         CBX5          -
#> 5:   2         181930711      +       UBE2E3          +
#> 6:  20          35920986      +       MANBAL          +
#>                                                 CompleteAmplificationID fragmentEstimate seqCount
#> 1: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC       102.945718      182
#> 2: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC         3.013590        4
#> 3: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC         5.033742        5
#> 4: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC         9.129338        9
#> 5: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC        50.502734       83
#> 6: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC        16.394422       39report_path argumentAs other import functions, also import_parallel_Vispa2Matrices() produces
an interactive report, use this argument to set the appropriate path were
the report should be saved.
mode argumentThis argument can take one of two values, AUTO or INTERACTIVE.
The INTERACTIVE workflow, as the name suggests, needs user console
input but allows a fine tuning of the import process. On the other hand,
AUTO allows a fully automated workflow but has of course some limitations.
What do you want to import?
In a fully automated mode, the function will try to import everything that
is contained in the input association file. This means that if you need to
import only a specific set of projects/pools, you will need to filter the
association file accordingly prior calling the function (you can easily
do that via the filter_for argument as explained in Section 4.4).
In interactive mode the function will ask you to type what you want to import.
How to deal with duplicates?
When scanning folders for files that match a given pattern (in our case the
function looks for matrices that match the quantification type and the
matrix type), it is very possible that the same folder contains multiple files
for the same quantification. Of course this is not recommended, we suggest to
move the duplicated files in a sub directory or remove them if they’re not
necessary, but in case this happens, in interactive mode, the function asks
directly which files should be considered. Of course this is not possible in
automated mode, therefore you need to set two other arguments (described
in the next sub sections) to “help” the function discriminate
between duplicates. Please note that if such discrimination is not possible
no files are imported.
patterns argumentThis argument is relevant only if mode is set to AUTO. Providing a
set of patterns (interpreted as regular expressions) helps the function to
choose between duplicated files if any are found. If you’re confident your
folders don’t contain any duplicates feel free to ignore this argument.
matching_opt argumentThis argument is relevant only if mode is set to AUTO and patterns
isn’t NULL. Tells the function how to match the given patterns if multiple
are supplied: ALL means keep only those files whose name matches all the
given patterns, ANY means keep only those files whose name matches any of the
given patterns and OPTIONAL expresses a preference, try to find files that
contain the patterns and if you don’t find any return whatever you find.
... argumentAdditional named arguments to supply to both import_association_file() and
comparison_matrix().
Earlier versions of the package featured two separated functions,
import_parallel_Vispa2Matrices_auto() and
import_parallel_Vispa2Matrices_interactive(). Those functions are now
officially deprecated (since ISAnalytics 1.3.3) and will be defunct on
the next release cycle.
This section goes more in detail on some data cleaning and pre-processing operations you can perform with this package.
ISAnalytics offers several different functions for cleaning and pre-processing your data.
compute_near_integrations()outlier_filter()remove_collisions()purity_filter()aggregate_values_by_key(), aggregate_metadata()In this section we illustrate the functions dedicated to collision removal.
We’re not going into too much detail here, but we’re going to explain in a very simple way what a “collision” is and how the function in this package deals with them.
We say that an integration (aka a unique combination of
mandatory_IS_vars()) is a collision if this combination is shared
between different independent samples: an independent sample is a unique
combination of metadata fields specified by the user.
The reason behind this is that it’s highly improbable to observe
the very same integration in two different independent samples
and this phenomenon might
be an indicator of some kind of contamination in the sequencing phase or in
PCR phase, for this reason we might want to exclude such contamination from
our analysis.
ISAnalytics provides a function that processes the imported data for the
removal or reassignment of these “problematic” integrations,
remove_collisions().
The processing is done using the sequence count value, so the corresponding matrix is needed for this operation.
The remove_collisions() function follows several logical
steps to decide whether
an integration is a collision and if it is it decides whether to re-assign it or
remove it entirely based on different criteria.
The function uses the information stored in the association file to assess which independent samples are present and counts the number of independent samples for each integration: those who have a count > 1 are considered collisions.
| chr | integration_locus | strand | seqCount | CompleteAmplificationID | SubjectID | ProjectID | 
|---|---|---|---|---|---|---|
| 1 | 123454 | + | 653 | SAMPLE1 | SUBJ01 | PJ01 | 
| 1 | 123454 | + | 456 | SAMPLE2 | SUBJ02 | PJ01 | 
Once the collisions are identified, the function follows 3 steps where it tries to re-assign the combination to a single independent sample. The criteria are:
reads_ratio), the default value is 10.If none of the criteria were sufficient to make a decision, the integration is simply removed from the matrix.
data("integration_matrices", package = "ISAnalytics")
data("association_file", package = "ISAnalytics")
## Multi quantification matrix
no_coll <- remove_collisions(x = integration_matrices,
                             association_file = association_file,
                             report_path = NULL)
#> Identifying collisions...
#> Processing collisions...
#> 
  |                                                                                                                    
  |                                                                                                              |   0%
  |                                                                                                                    
  |===========                                                                                                   |  10%
  |                                                                                                                    
  |======================                                                                                        |  20%
  |                                                                                                                    
  |=================================                                                                             |  30%
  |                                                                                                                    
  |============================================                                                                  |  40%
  |                                                                                                                    
  |=======================================================                                                       |  50%
  |                                                                                                                    
  |==================================================================                                            |  60%
  |                                                                                                                    
  |=============================================================================                                 |  70%
  |                                                                                                                    
  |========================================================================================                      |  80%
  |                                                                                                                    
  |===================================================================================================           |  90%
  |                                                                                                                    
  |==============================================================================================================| 100%
#> Finished!
## Matrix list
separated <- separate_quant_matrices(integration_matrices)
no_coll_list <- remove_collisions(x = separated,
                             association_file = association_file,
                             report_path = NULL)
#> Identifying collisions...
#> Processing collisions...
#> 
  |                                                                                                                    
  |                                                                                                              |   0%
  |                                                                                                                    
  |===========                                                                                                   |  10%
  |                                                                                                                    
  |======================                                                                                        |  20%
  |                                                                                                                    
  |=================================                                                                             |  30%
  |                                                                                                                    
  |============================================                                                                  |  40%
  |                                                                                                                    
  |=======================================================                                                       |  50%
  |                                                                                                                    
  |==================================================================                                            |  60%
  |                                                                                                                    
  |=============================================================================                                 |  70%
  |                                                                                                                    
  |========================================================================================                      |  80%
  |                                                                                                                    
  |===================================================================================================           |  90%
  |                                                                                                                    
  |==============================================================================================================| 100%
#> Finished!
## Only sequence count
no_coll_single <- remove_collisions(x = separated$seqCount,
                             association_file = association_file,
                             quant_cols = c(seqCount = "Value"),
                             report_path = NULL)
#> Identifying collisions...
#> Processing collisions...
#> 
  |                                                                                                                    
  |                                                                                                              |   0%
  |                                                                                                                    
  |===========                                                                                                   |  10%
  |                                                                                                                    
  |======================                                                                                        |  20%
  |                                                                                                                    
  |=================================                                                                             |  30%
  |                                                                                                                    
  |============================================                                                                  |  40%
  |                                                                                                                    
  |=======================================================                                                       |  50%
  |                                                                                                                    
  |==================================================================                                            |  60%
  |                                                                                                                    
  |=============================================================================                                 |  70%
  |                                                                                                                    
  |========================================================================================                      |  80%
  |                                                                                                                    
  |===================================================================================================           |  90%
  |                                                                                                                    
  |==============================================================================================================| 100%
#> Finished!Important notes on the association file:
The function accepts different inputs, namely:
quantification_types()If the option ISAnalytics.reports is active, an interactive report in
HTML format will be produced at the specified path.
If you’ve given as input the standalone sequence count
matrix to remove_collisions(), to realign other matrices you have
to call the function realign_after_collisions(), passing as input the
processed sequence count matrix and the named list of other matrices
to realign.
NOTE: the names in the list must be quantification types.
other_realigned <- realign_after_collisions(
  sc_matrix = no_coll_single,
  other_matrices = list(fragmentEstimate = separated$fragmentEstimate)
)In this section we’re going to explain in detail how to use functions of the aggregate family, namely:
aggregate_metadata()aggregate_values_by_key()We refer to information contained in the association file as “metadata”:
sometimes it’s useful to obtain collective information based on a certain
group of variables we’re interested in. The function aggregate_metadata()
does just that: according to the grouping variables, meaning the names of
the columns in the association file to perform a group_by operation with,it
creates a summary. You can fully customize the summary by providing a
“function table” that tells the function which operation should be
applied to which column and what name to give to the output column.
A default is already supplied:
#> # A tibble: 15 × 4
#>    Column               Function  Args  Output_colname
#>    <chr>                <list>    <lgl> <chr>         
#>  1 FusionPrimerPCRDate  <formula> NA    {.col}_min    
#>  2 LinearPCRDate        <formula> NA    {.col}_min    
#>  3 VCN                  <formula> NA    {.col}_avg    
#>  4 ng DNA corrected     <formula> NA    {.col}_avg    
#>  5 Kapa                 <formula> NA    {.col}_avg    
#>  6 ng DNA corrected     <formula> NA    {.col}_sum    
#>  7 ulForPool            <formula> NA    {.col}_sum    
#>  8 BARCODE_MUX          <formula> NA    {.col}_sum    
#>  9 TRIMMING_FINAL_LTRLC <formula> NA    {.col}_sum    
#> 10 LV_MAPPED            <formula> NA    {.col}_sum    
#> 11 BWA_MAPPED_OVERALL   <formula> NA    {.col}_sum    
#> 12 ISS_MAPPED_PP        <formula> NA    {.col}_sum    
#> 13 PCRMethod            <formula> NA    {.col}        
#> 14 NGSTechnology        <formula> NA    {.col}        
#> 15 DNAnumber            <formula> NA    {.col}You can either provide purrr-style lambdas (as given in the example above),
or simply specify the name of the function and additional parameters as a
list in a separated column. If you choose to provide your own table you
should maintain the column names for the function to work properly.
For more details on this take a look at the function documentation
?default_meta_agg.
import_assocition_file(). If you need more
information on import function please view the vignette
“How to use import functions”:
vignette("how_to_import_functions", package="ISAnalytics").data("association_file", package = "ISAnalytics")
aggregated_meta <- aggregate_metadata(association_file = association_file)#> # A tibble: 20 × 19
#>    SubjectID CellMarker Tissue TimePoint FusionPrimerPCRDate_min LinearPCRDate_min VCN_avg `ng DNA corrected_…` Kapa_avg
#>    <chr>     <chr>      <chr>  <chr>     <date>                  <date>              <dbl>                <dbl>    <dbl>
#>  1 PT001     MNC        BM     0030      2016-11-03              Inf                  0.26                300.     NaN  
#>  2 PT001     MNC        BM     0060      2016-11-03              Inf                  0.42                171.     NaN  
#>  3 PT001     MNC        BM     0090      2016-11-03              Inf                  0.35                 89.2    NaN  
#>  4 PT001     MNC        BM     0180      2016-11-03              Inf                  0.27                181.     NaN  
#>  5 PT001     MNC        BM     0360      2017-04-21              Inf                  0.18                 42      NaN  
#>  6 PT001     MNC        PB     0030      2016-11-03              Inf                  0.23                 23.8    NaN  
#>  7 PT001     MNC        PB     0060      2016-11-03              Inf                  0.3                  23.2    NaN  
#>  8 PT001     MNC        PB     0090      2016-11-03              Inf                  0.26                 26.6    NaN  
#>  9 PT001     MNC        PB     0180      2016-11-03              Inf                  0.24                 23.1    NaN  
#> 10 PT001     MNC        PB     0360      2017-04-21              Inf                  0.19                 45.2    NaN  
#> 11 PT002     MNC        BM     0030      2017-04-21              Inf                  1.01                182.     NaN  
#> 12 PT002     MNC        BM     0060      2017-05-05              Inf                  1.24                172.     NaN  
#> 13 PT002     MNC        BM     0090      2017-05-05              Inf                  2.05                135.     NaN  
#> 14 PT002     MNC        BM     0180      2017-05-16              Inf                  2.3                 196.     NaN  
#> 15 PT002     MNC        BM     0360      2018-03-12              Inf                  2.52                299.      38.4
#> 16 PT002     MNC        PB     0030      2017-04-21              Inf                  0.77                106.     NaN  
#> 17 PT002     MNC        PB     0060      2017-05-05              Inf                  1.07                 77.4    NaN  
#> 18 PT002     MNC        PB     0090      2017-05-05              Inf                  1.09                123      NaN  
#> 19 PT002     MNC        PB     0180      2017-05-05              Inf                  1.43                123.     NaN  
#> 20 PT002     MNC        PB     0360      2018-03-12              Inf                  2.07                300.      22.0
#> # … with 10 more variables: `ng DNA corrected_sum` <dbl>, ulForPool_sum <dbl>, BARCODE_MUX_sum <int>,
#> #   TRIMMING_FINAL_LTRLC_sum <int>, LV_MAPPED_sum <int>, BWA_MAPPED_OVERALL_sum <int>, ISS_MAPPED_PP_sum <int>,
#> #   PCRMethod <chr>, NGSTechnology <chr>, DNAnumber <chr>ISAnalytics contains useful functions to aggregate the values contained in
your imported matrices based on a key, aka a single column or a combination of
columns contained in the association file that are related to the samples.
import_parallel_Vispa2Matrices()data("integration_matrices", package = "ISAnalytics")
data("association_file", package = "ISAnalytics")
aggreg <- aggregate_values_by_key(
  x = integration_matrices, 
  association_file = association_file,
  value_cols = c("seqCount", "fragmentEstimate")
)#> # A tibble: 1,074 × 11
#>    chr   integration_locus strand GeneName GeneStrand SubjectID CellMarker Tissue TimePoint seqCount_sum
#>    <chr>             <dbl> <chr>  <chr>    <chr>      <chr>     <chr>      <chr>  <chr>            <dbl>
#>  1 1               8464757 -      RERE     -          PT001     MNC        BM     0030               542
#>  2 1               8464757 -      RERE     -          PT001     MNC        BM     0060                 1
#>  3 1               8607357 +      RERE     -          PT001     MNC        BM     0060                 1
#>  4 1               8607357 +      RERE     -          PT001     MNC        BM     0180              1096
#>  5 1               8607357 +      RERE     -          PT001     MNC        BM     0360               330
#>  6 1               8607362 -      RERE     -          PT001     MNC        BM     0180              1702
#>  7 1               8850362 +      RERE     -          PT002     MNC        BM     0360               562
#>  8 1              11339120 +      UBIAD1   +          PT001     MNC        BM     0060              1605
#>  9 1              11339120 +      UBIAD1   +          PT001     MNC        PB     0060                 1
#> 10 1              11339120 +      UBIAD1   +          PT001     MNC        PB     0180                 1
#>    fragmentEstimate_sum
#>                   <dbl>
#>  1                 3.01
#>  2                 1.00
#>  3                 1.00
#>  4                 5.01
#>  5                34.1 
#>  6                 4.01
#>  7                 3.01
#>  8                 8.03
#>  9                 1.00
#> 10                 1.00
#> # … with 1,064 more rowsThe function aggregate_values_by_key can perform the aggregation both on the
list of matrices and a single matrix.
The function has several different parameters that have default values that can be changed according to user preference.
key valuec("SubjectID", "CellMarker", "Tissue", "TimePoint")
(same default key as the aggregate_metadata
function).agg1 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = c("SubjectID", "ProjectID"),
    value_cols = c("seqCount", "fragmentEstimate")
)#> # A tibble: 577 × 9
#>    chr   integration_locus strand GeneName GeneStrand SubjectID ProjectID seqCount_sum fragmentEstimate_sum
#>    <chr>             <dbl> <chr>  <chr>    <chr>      <chr>     <chr>            <dbl>                <dbl>
#>  1 1               8464757 -      RERE     -          PT001     PJ01               543                 4.01
#>  2 1               8607357 +      RERE     -          PT001     PJ01              1427                40.1 
#>  3 1               8607362 -      RERE     -          PT001     PJ01              1702                 4.01
#>  4 1               8850362 +      RERE     -          PT002     PJ01               562                 3.01
#>  5 1              11339120 +      UBIAD1   +          PT001     PJ01              1607                10.0 
#>  6 1              12341466 -      VPS13D   +          PT002     PJ01              1843                 8.05
#>  7 1              14034054 -      PRDM2    +          PT002     PJ01              1938                 3.01
#>  8 1              16186297 -      SPEN     +          PT001     PJ01              3494                16.1 
#>  9 1              16602483 +      FBXO42   -          PT001     PJ01              2947                 9.04
#> 10 1              16602483 +      FBXO42   -          PT002     PJ01                30                 2.00
#> # … with 567 more rowslambda valuelambda parameter indicates the function(s) to be applied to the
values for aggregation.
lambda must be a named list of either functions or purrr-style lambdas:
if you would like to specify additional parameters to the function
the second option is recommended.
The only important note on functions is that they should perform some kind of
aggregation on numeric values: this means in practical terms they need
to accept a vector of numeric/integer values as input and produce a
SINGLE value as output. Valid options for this purpose might be: sum, mean,
median, min, max and so on.agg2 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(mean = ~ mean(.x, na.rm = TRUE)),
    value_cols = c("seqCount", "fragmentEstimate")
)#> # A tibble: 577 × 8
#>    chr   integration_locus strand GeneName GeneStrand SubjectID seqCount_mean fragmentEstimate_mean
#>    <chr>             <dbl> <chr>  <chr>    <chr>      <chr>             <dbl>                 <dbl>
#>  1 1               8464757 -      RERE     -          PT001              272.                  2.01
#>  2 1               8607357 +      RERE     -          PT001              285.                  8.02
#>  3 1               8607362 -      RERE     -          PT001              851                   2.01
#>  4 1               8850362 +      RERE     -          PT002              562                   3.01
#>  5 1              11339120 +      UBIAD1   +          PT001              321.                  2.01
#>  6 1              12341466 -      VPS13D   +          PT002             1843                   8.05
#>  7 1              14034054 -      PRDM2    +          PT002             1938                   3.01
#>  8 1              16186297 -      SPEN     +          PT001              699.                  3.22
#>  9 1              16602483 +      FBXO42   -          PT001              982.                  3.01
#> 10 1              16602483 +      FBXO42   -          PT002               30                   2.00
#> # … with 567 more rowsNote that, when specifying purrr-style lambdas (formulas), the first
parameter needs to be set to .x, other parameters can be set as usual.
You can also use in lambda functions that produce data frames or lists.
In this case all variables from the produced data frame will be included
in the final data frame. For example:
agg3 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(describe = ~ list(psych::describe(.x))),
    value_cols = c("seqCount", "fragmentEstimate")
)#> # A tibble: 577 × 8
#>    chr   integration_locus strand GeneName GeneStrand SubjectID seqCount_describe fragmentEstimate_describe
#>    <chr>             <dbl> <chr>  <chr>    <chr>      <chr>     <list>            <list>                   
#>  1 1               8464757 -      RERE     -          PT001     <psych [1 × 13]>  <psych [1 × 13]>         
#>  2 1               8607357 +      RERE     -          PT001     <psych [1 × 13]>  <psych [1 × 13]>         
#>  3 1               8607362 -      RERE     -          PT001     <psych [1 × 13]>  <psych [1 × 13]>         
#>  4 1               8850362 +      RERE     -          PT002     <psych [1 × 13]>  <psych [1 × 13]>         
#>  5 1              11339120 +      UBIAD1   +          PT001     <psych [1 × 13]>  <psych [1 × 13]>         
#>  6 1              12341466 -      VPS13D   +          PT002     <psych [1 × 13]>  <psych [1 × 13]>         
#>  7 1              14034054 -      PRDM2    +          PT002     <psych [1 × 13]>  <psych [1 × 13]>         
#>  8 1              16186297 -      SPEN     +          PT001     <psych [1 × 13]>  <psych [1 × 13]>         
#>  9 1              16602483 +      FBXO42   -          PT001     <psych [1 × 13]>  <psych [1 × 13]>         
#> 10 1              16602483 +      FBXO42   -          PT002     <psych [1 × 13]>  <psych [1 × 13]>         
#> # … with 567 more rowsvalue_cols valuevalue_cols parameter tells the function on which numeric columns
of x the functions should be applied.
Note that every function contained in lambda will be applied to every
column in value_cols: resulting columns will be named as
“original name_function applied”.agg4 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(sum = sum, mean = mean),
    value_cols = c("seqCount", "fragmentEstimate")
)#> # A tibble: 577 × 10
#>    chr   integration_locus strand GeneName GeneStrand SubjectID seqCount_sum seqCount_mean fragmentEstimate_sum
#>    <chr>             <dbl> <chr>  <chr>    <chr>      <chr>            <dbl>         <dbl>                <dbl>
#>  1 1               8464757 -      RERE     -          PT001              543          272.                 4.01
#>  2 1               8607357 +      RERE     -          PT001             1427          285.                40.1 
#>  3 1               8607362 -      RERE     -          PT001             1702          851                  4.01
#>  4 1               8850362 +      RERE     -          PT002              562          562                  3.01
#>  5 1              11339120 +      UBIAD1   +          PT001             1607          321.                10.0 
#>  6 1              12341466 -      VPS13D   +          PT002             1843         1843                  8.05
#>  7 1              14034054 -      PRDM2    +          PT002             1938         1938                  3.01
#>  8 1              16186297 -      SPEN     +          PT001             3494          699.                16.1 
#>  9 1              16602483 +      FBXO42   -          PT001             2947          982.                 9.04
#> 10 1              16602483 +      FBXO42   -          PT002               30           30                  2.00
#>    fragmentEstimate_mean
#>                    <dbl>
#>  1                  2.01
#>  2                  8.02
#>  3                  2.01
#>  4                  3.01
#>  5                  2.01
#>  6                  8.05
#>  7                  3.01
#>  8                  3.22
#>  9                  3.01
#> 10                  2.00
#> # … with 567 more rowsgroup valuegroup parameter should contain all other variables to include in the
grouping besides key. By default this contains
c("chr", "integration_locus","strand", "GeneName", "GeneStrand").
You can change this grouping as you see
fit, if you don’t want to add any other variable to the key, just set it to
NULL.agg5 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(sum = sum, mean = mean),
    group = c(mandatory_IS_vars()),
    value_cols = c("seqCount", "fragmentEstimate")
)#> # A tibble: 577 × 8
#>    chr   integration_locus strand SubjectID seqCount_sum seqCount_mean fragmentEstimate_sum fragmentEstimate_mean
#>    <chr>             <dbl> <chr>  <chr>            <dbl>         <dbl>                <dbl>                 <dbl>
#>  1 1               8464757 -      PT001              543          272.                 4.01                  2.01
#>  2 1               8607357 +      PT001             1427          285.                40.1                   8.02
#>  3 1               8607362 -      PT001             1702          851                  4.01                  2.01
#>  4 1               8850362 +      PT002              562          562                  3.01                  3.01
#>  5 1              11339120 +      PT001             1607          321.                10.0                   2.01
#>  6 1              12341466 -      PT002             1843         1843                  8.05                  8.05
#>  7 1              14034054 -      PT002             1938         1938                  3.01                  3.01
#>  8 1              16186297 -      PT001             3494          699.                16.1                   3.22
#>  9 1              16602483 +      PT001             2947          982.                 9.04                  3.01
#> 10 1              16602483 +      PT002               30           30                  2.00                  2.00
#> # … with 567 more rowsR session information.
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22)
#>  os       Ubuntu 20.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2022-05-24
#>  pandoc   2.5 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [2] CRAN (R 4.2.0)
#>  BiocManager    1.30.18 2022-05-18 [2] CRAN (R 4.2.0)
#>  BiocParallel   1.30.2  2022-05-24 [2] Bioconductor
#>  BiocStyle    * 2.24.0  2022-05-24 [2] Bioconductor
#>  bit            4.0.4   2020-08-04 [2] CRAN (R 4.2.0)
#>  bit64          4.0.5   2020-08-30 [2] CRAN (R 4.2.0)
#>  bookdown       0.26    2022-04-15 [2] CRAN (R 4.2.0)
#>  bslib          0.3.1   2021-10-06 [2] CRAN (R 4.2.0)
#>  cellranger     1.1.0   2016-07-27 [2] CRAN (R 4.2.0)
#>  cli            3.3.0   2022-04-25 [2] CRAN (R 4.2.0)
#>  colorspace     2.0-3   2022-02-21 [2] CRAN (R 4.2.0)
#>  crayon         1.5.1   2022-03-26 [2] CRAN (R 4.2.0)
#>  crosstalk      1.2.0   2021-11-04 [2] CRAN (R 4.2.0)
#>  curl           4.3.2   2021-06-23 [2] CRAN (R 4.2.0)
#>  data.table     1.14.2  2021-09-27 [2] CRAN (R 4.2.0)
#>  datamods       1.3.2   2022-05-06 [2] CRAN (R 4.2.0)
#>  DBI            1.1.2   2021-12-20 [2] CRAN (R 4.2.0)
#>  digest         0.6.29  2021-12-01 [2] CRAN (R 4.2.0)
#>  dplyr          1.0.9   2022-04-28 [2] CRAN (R 4.2.0)
#>  DT           * 0.23    2022-05-10 [2] CRAN (R 4.2.0)
#>  ellipsis       0.3.2   2021-04-29 [2] CRAN (R 4.2.0)
#>  eulerr         6.1.1   2021-09-06 [2] CRAN (R 4.2.0)
#>  evaluate       0.15    2022-02-18 [2] CRAN (R 4.2.0)
#>  fansi          1.0.3   2022-03-24 [2] CRAN (R 4.2.0)
#>  farver         2.1.0   2021-02-28 [2] CRAN (R 4.2.0)
#>  fastmap        1.1.0   2021-01-25 [2] CRAN (R 4.2.0)
#>  forcats        0.5.1   2021-01-27 [2] CRAN (R 4.2.0)
#>  foreign        0.8-82  2022-01-16 [2] CRAN (R 4.2.0)
#>  fs             1.5.2   2021-12-08 [2] CRAN (R 4.2.0)
#>  generics       0.1.2   2022-01-31 [2] CRAN (R 4.2.0)
#>  ggplot2        3.3.6   2022-05-03 [2] CRAN (R 4.2.0)
#>  glue           1.6.2   2022-02-24 [2] CRAN (R 4.2.0)
#>  gtable         0.3.0   2019-03-25 [2] CRAN (R 4.2.0)
#>  gtools         3.9.2.1 2022-05-23 [2] CRAN (R 4.2.0)
#>  haven          2.5.0   2022-04-15 [2] CRAN (R 4.2.0)
#>  highr          0.9     2021-04-16 [2] CRAN (R 4.2.0)
#>  hms            1.1.1   2021-09-26 [2] CRAN (R 4.2.0)
#>  htmltools      0.5.2   2021-08-25 [2] CRAN (R 4.2.0)
#>  htmlwidgets    1.5.4   2021-09-08 [2] CRAN (R 4.2.0)
#>  httpuv         1.6.5   2022-01-05 [2] CRAN (R 4.2.0)
#>  httr           1.4.3   2022-05-04 [2] CRAN (R 4.2.0)
#>  ISAnalytics  * 1.6.2   2022-05-24 [1] Bioconductor
#>  jquerylib      0.1.4   2021-04-26 [2] CRAN (R 4.2.0)
#>  jsonlite       1.8.0   2022-02-22 [2] CRAN (R 4.2.0)
#>  knitr          1.39    2022-04-26 [2] CRAN (R 4.2.0)
#>  later          1.3.0   2021-08-18 [2] CRAN (R 4.2.0)
#>  lattice        0.20-45 2021-09-22 [2] CRAN (R 4.2.0)
#>  lifecycle      1.0.1   2021-09-24 [2] CRAN (R 4.2.0)
#>  lubridate      1.8.0   2021-10-07 [2] CRAN (R 4.2.0)
#>  magrittr     * 2.0.3   2022-03-30 [2] CRAN (R 4.2.0)
#>  mime           0.12    2021-09-28 [2] CRAN (R 4.2.0)
#>  mnormt         2.0.2   2020-09-01 [2] CRAN (R 4.2.0)
#>  munsell        0.5.0   2018-06-12 [2] CRAN (R 4.2.0)
#>  nlme           3.1-157 2022-03-25 [2] CRAN (R 4.2.0)
#>  openxlsx       4.2.5   2021-12-14 [2] CRAN (R 4.2.0)
#>  pillar         1.7.0   2022-02-01 [2] CRAN (R 4.2.0)
#>  pkgconfig      2.0.3   2019-09-22 [2] CRAN (R 4.2.0)
#>  plyr           1.8.7   2022-03-24 [2] CRAN (R 4.2.0)
#>  polyclip       1.10-0  2019-03-14 [2] CRAN (R 4.2.0)
#>  polylabelr     0.2.0   2020-04-19 [2] CRAN (R 4.2.0)
#>  promises       1.2.0.1 2021-02-11 [2] CRAN (R 4.2.0)
#>  psych          2.2.5   2022-05-10 [2] CRAN (R 4.2.0)
#>  purrr          0.3.4   2020-04-17 [2] CRAN (R 4.2.0)
#>  R.methodsS3    1.8.1   2020-08-26 [2] CRAN (R 4.2.0)
#>  R.oo           1.24.0  2020-08-26 [2] CRAN (R 4.2.0)
#>  R.utils        2.11.0  2021-09-26 [2] CRAN (R 4.2.0)
#>  R6             2.5.1   2021-08-19 [2] CRAN (R 4.2.0)
#>  Rcapture       1.4-4   2022-05-04 [2] CRAN (R 4.2.0)
#>  Rcpp           1.0.8.3 2022-03-17 [2] CRAN (R 4.2.0)
#>  readr          2.1.2   2022-01-30 [2] CRAN (R 4.2.0)
#>  readxl         1.4.0   2022-03-28 [2] CRAN (R 4.2.0)
#>  RefManageR   * 1.3.0   2020-11-13 [2] CRAN (R 4.2.0)
#>  rio            0.5.29  2021-11-22 [2] CRAN (R 4.2.0)
#>  rlang          1.0.2   2022-03-04 [2] CRAN (R 4.2.0)
#>  rmarkdown      2.14    2022-04-25 [2] CRAN (R 4.2.0)
#>  sass           0.4.1   2022-03-23 [2] CRAN (R 4.2.0)
#>  scales         1.2.0   2022-04-13 [2] CRAN (R 4.2.0)
#>  sessioninfo  * 1.2.2   2021-12-06 [2] CRAN (R 4.2.0)
#>  shiny          1.7.1   2021-10-02 [2] CRAN (R 4.2.0)
#>  shinyWidgets   0.7.0   2022-05-11 [2] CRAN (R 4.2.0)
#>  stringi        1.7.6   2021-11-29 [2] CRAN (R 4.2.0)
#>  stringr        1.4.0   2019-02-10 [2] CRAN (R 4.2.0)
#>  tibble         3.1.7   2022-05-03 [2] CRAN (R 4.2.0)
#>  tidyr          1.2.0   2022-02-01 [2] CRAN (R 4.2.0)
#>  tidyselect     1.1.2   2022-02-21 [2] CRAN (R 4.2.0)
#>  tmvnsim        1.0-2   2016-12-15 [2] CRAN (R 4.2.0)
#>  tzdb           0.3.0   2022-03-28 [2] CRAN (R 4.2.0)
#>  utf8           1.2.2   2021-07-24 [2] CRAN (R 4.2.0)
#>  vctrs          0.4.1   2022-04-13 [2] CRAN (R 4.2.0)
#>  vroom          1.5.7   2021-11-30 [2] CRAN (R 4.2.0)
#>  withr          2.5.0   2022-03-03 [2] CRAN (R 4.2.0)
#>  xfun           0.31    2022-05-10 [2] CRAN (R 4.2.0)
#>  xml2           1.3.3   2021-11-30 [2] CRAN (R 4.2.0)
#>  xtable         1.8-4   2019-04-21 [2] CRAN (R 4.2.0)
#>  yaml           2.3.5   2022-02-21 [2] CRAN (R 4.2.0)
#>  zip            2.2.0   2021-05-31 [2] CRAN (R 4.2.0)
#> 
#>  [1] /tmp/Rtmp2hFq31/Rinst21ec1c75e42839
#>  [2] /home/biocbuild/bbs-3.15-bioc/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────This vignette was generated using BiocStyle (Oleś, 2022) with knitr (Xie, 2022) and rmarkdown (Allaire, Xie, McPherson, et al., 2022) running behind the scenes.
Citations made with RefManageR (McLean, 2017).
[1] J. Allaire, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 2.14. 2022. URL: https://github.com/rstudio/rmarkdown.
[2] S. B. Giulio Spinozzi Andrea Calabria. “VISPA2: a scalable pipeline for high-throughput identification and annotation of vector integration sites”. In: BMC Bioinformatics (Nov. 25, 2017). DOI: 10.1186/s12859-017-1937-9.
[3] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.
[4] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.24.0. 2022. URL: https://github.com/Bioconductor/BiocStyle.
[5] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.39. 2022. URL: https://yihui.org/knitr/.