Packages used for the upcoming Data Science program (Omics2020)

https://sciencecoach.t-bio.info/schools/omics2020-data-science/

Recommended activity/textbook on biostatistics to complete before the course:

https://seeing-theory.brown.edu

R

It is common for today’s scientific and business industries to collect large amounts of data, and the ability to analyze the data and learn from it is critical to making informed decisions. Familiarity with software such as R allows users to visualize data, run statistical tests, and apply machine learning algorithms. Even if you already know other software, there are still good reasons to learn R: 

  1. R is free. If your future employer does not already have R installed, you can always download it for free, unlike other proprietary software packages that require expensive licenses. No matter where you travel, you can have access to R on your computer. 
  2. R gives you access to cutting-edge technology. Top researchers develop statistical learning methods in R, and new algorithms are constantly added to the list of packages you can download. 
  3. R is a useful skill. Employers that value analytics recognize R as useful and important. If for no other reason, learning R is worthwhile to help boost your resume. 

Note that R is a programming language, and there is no intuitive graphical user interface with buttons you can click to run different methods. However, with some practice, this kind of environment makes it easy to quickly code scripts and functions for various statistical purposes. 

Extracted from this useful resource from MIT course on R: https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/lecture-notes/MIT15_097S12_lec02.pdf

Video on installing and using the R studio interface: https://youtu.be/hW7WD9DyDXs

Recommended full book on R for data science: https://r4ds.had.co.nz

 

package name description link
tidyverse  The ‘tidyverse’ is a collection of packages we will be discussing later and contains multiple packages for data cleaning, preprocessing, data analysis  and visualization. It’s presumed as ‘all in one package’ solution for R. https://www.tidyverse.org
tidyr  Part of the ‘tidyverse’. Easily ‘tidy’ your data using restructuring of columns. Make a smaller or larger number of columns using existing databases. https://www.rdocumentation.org/packages/tidyr/versions/0.8.3
dplyr  Another ‘tidyverse’ package. A data manipulation package consisting of commands for functions like filtering, arranging and summarizing data. https://www.rdocumentation.org/packages/dplyr/versions/0.7.8
stats v3.6.2 This package contains functions for statistical calculations and random number generation. Stats contains functions for most statistical and mathematical operations in R. https://www.rdocumentation.org/packages/stats/versions/3.6.2
DataExplorer Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights http://boxuancui.github.io/DataExplorer/
histogram Construct regular and irregular histograms with different options of widths of the bins. (Histograms are visual plots that present distribution of data). https://cran.r-project.org/web/packages/histogram/histogram.pdf
corrplot    A package for displaying correlations between objects/variables  in a unified manner as a correlation matrix with different options for color and shapes of objects. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
cluster A package that contains easy to use code for both ‘clustering’ of similar objects in the data, but also plotting the results on 2D plots. https://cran.r-project.org/web/packages/cluster/cluster.pdf
randomForest A package for creating predictive classification models based on a given data. A form of supervised learning package based on random forest method combining multitude of decision trees to create a decision based random forest. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
tree  A package for creating decision based branching predictions in order to predict an outcome based on cut-off values or categories for genes or their transcripts (RNA) in data. https://cran.r-project.org/web/packages/tree/tree.pdf
factoextra Intuitive package to easy visualization of clustering analysis. Applies for PCA, k means and hierarchical clustering. https://cran.r-project.org/web/packages/factoextra/index.html
scatterplot3d Use a 3d interface to visualize data as points in space and have a sense of distances between objects https://cran.r-project.org/web/packages/scatterplot3d/index.html
ggplot2 One of the most famous data visualization packages. Contains code for graphical plotting most results from data analysis to very intuitive plots for visual interpretation. Part of the ‘tidyverse’ group of packages. https://cran.r-project.org/web/packages/ggplot2/index.html
ggpubr Manipulate through visual and statistical aspects of plots in R. Add p-values and significance levels to box plots, bar plots, line plots, and other plots. http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/
ggfortify Plotting interface for analysis of the results and plots in a unified style using ‘ggplot2’. Applies for PCA and clustering. https://cran.r-project.org/web/packages/ggfortify/index.html
plotly A package for plotting both publication ready and interactive plots in R. https://plotly.com/r/getting-started/
ellipse A graphical display of statistical values like statistical significance and confidence intervals on various plots. https://cran.r-project.org/web/packages/ellipse/index.html
metafor v2.1-0 Package used mainly for meta analysis (multiple study results agregation). Contains leave1out function used to validate the results. http://www.metafor-project.org/doku.php
EnhancedVolcano A publication grade visualization package for interpreting results from hypothesis test based bioinformatics methods like differential gene expression. As the plots of the results for differentially expressed genes resemble a volcano shape the plots are called volcano plots.  http://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html
DESeq2 (Bioconductor) A package mainly used for differential expression analysis. This method is a hypothesis based method for comparing gene expressions from a level of the whole genome to a level of a single gene and applying a statistical test for each matching gene while adjusting for variables like sequencing depth and number of genes analysed. As you can see we have a Bioconductor mark between semicolons. Bioconductor is a platform that  provides the tools for the analysis and comprehension of high-throughput data. Bioconductor is based on R statistical programming language, and is open source, meaning everyone can use it for their science.  https://bioconductor.org/packages/release/bioc/html/DESeq2.html
edgeR (Bioconductor) Another differential expression method similar to DESeq2 but more robust in terms of finding most relevant statistically significant different gene expression in a dataset https://www.bioconductor.org/packages/release/bioc/html/edgeR.html
org.Hs.eg.db (Bioconductor) Package used to annotate differentially expressed genes. Use gene names, chromosome locations and other annotations like functions of genes and their products to add biological relevance to results. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

Example tutorial in R:

https://edu.t-bio.info/course/pca-data-visualization-r/

Python 

Python is a perfect choice for beginners to jump into the field of machine learning and data science. It is a minimalistic and intuitive language with a full-featured library line (also called frameworks) which significantly reduces the time required to get your first results. One such library is Scikit-learn, which provides simple and efficient tools for predictive data analysis, making advanced methods of analysis and visualization accessible to everybody, and reusable in various contexts (includes NumPy, SciPy, and matplotlib).

Quick guide to getting started:

https://scikit-learn.org/stable/getting_started.html

 

numpy  This is a package for basic data wrangling and creation of objects in Python. Can be used to create and modify a variety of arrays which are basic data structure types in Python. Indexing, switching between data types and others are among functions to perform using this package. https://numpy.org/doc/stable/
pandas Pandas is a data analysis and data wrangling  toolkit for python.  https://pandas.pydata.org/

https://pandas.pydata.org/docs/pandas.pdf

scipy User-friendly numerical functions like linear algebra and statistics and many other science based mathematical concepts in biology and engineering.. https://www.scipy.org/
plotly Same package we discussed for R applications regarding the publication ready plots. Intuitive visual interface for statistics plots in Python environment. https://plotly.com/python/getting-started/?utm_source=mailchimp-jan-2015&utm_medium=email&utm_campaign=generalemail-jan2015&utm_term=bubble-chart
matplotlib One of the most popular Python visualization libraries. Visualize statistical, clustering and machine learning concepts in publication grade plots. https://matplotlib.org/
statistics A statistical package for Python. Includes functions to easily find means, medians, modes, deviations, confidence intervals and compare them statistically. Also these packages optimized for numerical data wrangling like extracting part of the data we want based on numerical cut-off values, rounding them and transforming them. https://docs.python.org/3/library/statistics.html
biopython Biopython is a package that specializes in a python environment with a bioinformatics toolkit. Working with importing data from databases like ENTREZ or NCBI, sequences and annotations are among functions we can perform in biopython. https://biopython.org/wiki/Documentation
seaborn Data visualization  library that is based on matplotlib. It provides a possibility of drawing attractive and informative statistical graphics. https://seaborn.pydata.org/
AGEpy Package used for Computational Biology, mainly to retrieve annotations for bioinformatics data and give biological meaning to statistics and informatics. https://pypi.org/project/AGEpy/
scikit-learn 0.23.1 Scikit learn is one of the most complete and advanced packages for Python providing methods for machine learning, deep learning, reinforcement learning and other aspects of AI based methods in Python. https://scikit-learn.org/stable/

Fill out these fields!