<< Back to MOTIFvations Blog Home Page

Life in the FASTQ Lane

Bioinformatic Analysis: The Journey of a FASTQ File

March 25, 2022

Table of Contents:

Introduction
The Sequencer ‐ Where it Begins
Quality Control ‐ Check Your Work!
Peaks and Valleys
The Final Analysis

Introduction

While many “bench scientists” are familiar with the workflows of ChIP-Seq and ATAC-Seq, and even the preparation and analysis of the libraries, the steps between sequencing and fully analyzed data is sometimes thought of as a mystery known only to bioinformatic experts. Most of us have some understanding that the raw data is usually in a file format called a FASTQ. But how do we get from FASTQ files to peaks on a genome browser? This article will provide a peek behind the curtain of the informatic analysis we perform at Active Motif, as part of our end-to-end epigenetic services.

Back to Table of Contents

The Sequencer ‐ Where it Begins

To begin, the flow cell of the Illumina sequencer is loaded with multiplexed libraries and the wet lab portion of our ChIP-Seq experiment is complete. The sequencer creates a single BCL (binary base call) file which contains data from all the multiplexed libraries on the flow cell. At this point, every individual library is sequenced, but the reads corresponding to each are all mixed together. Leveraging the barcoding scheme used to multiplex the libraries together we can now perform what is called demultiplexing to segregate only the reads belonging to each sample into individual libraries. Using the Illumina software bcl2fastq, one can perform demultiplexing to generate individual data files called FASTQ files for each of the libraries. The first informatic step is to complete the Illumina Sample Sheet. This is a CSV (comma-separated values) file that you can edit in Excel that tells the software 1) which library has which barcode sequence, 2) was the run single-end or paired-end, and 3) how many cycles were run (also called the read length). For our customers who choose to perform their own analysis, the FASTQ files are returned to them as their finished product. For those who take advantage of our end-to-end service, additional analysis is provided as described below.

Back to Table of Contents

Quality Control ‐ Check Your Work!

The next step is to perform quality control assessment of the FASTQ files of the project. Using software called fastqc, we can assess metrics such as base quality distribution, adapter contamination, k-mer content, and sequence duplication levels. We additionally perform a multi-species comparison of the libraries, using the Babraham Bioinformatics fastq_screen software, to confirm there are no reads from species other than that of the organism of interest contaminating the sample’s data. These steps are a critical part of the workflow to ensure that the final data is of high quality and suitable for downstream analysis.

Following QC, we next align the raw data in the FASTQ file to an annotated genome. We have the sequence reads, but do not yet know to what regions they correspond to on the genome of the organism from which the chromatin was isolated. The genomes of humans, mice, rats, zebrafish, fruit flies, yeast, and the worm C. elegans (among others) have all be sufficiently annotated to align FASTQ files. We use a software called BWA (Burrows-Wheeler Aligner) to map sequencing reads to the annotated genome (Li and Durbin, 2009). This process produces BAM files from the raw FASTQ files. BAM files contain the sequences, but now also have the coordinate information of what specific position along the genome they correspond to.

Back to Table of Contents

Peaks and Valleys

Next, we need to define peaks, and for this we use two software suites called MACS/MACS2 (Zhang et al., 2008) or SICER (Zang et al., 2009). These programs use the BAM files created by BWA to determine if and where the reads are enriched within each sample, across the genome. These regions of signal enrichment are referred to as “peaks” and serve as the functional unit of much of the analysis. It is important to note that we normalize the data so that the peak “calling” and subsequent observations will not be driven by technical variation, but instead more heavily dependent on and reflect the underlying biology at play. This process results in the generation of a BED file that contains the chromosome, bp start position, end position, and a series of meta data associated with that peak for every peak called.

We additionally generate a bigWig (.bw format) file that contains this same peak information and, at 100-200 megabytes in size, is a lot more portable than the original raw FASTQ file, which can be between 1-2 gigabytes. In these bigWig files, we now have data that can be uploaded to a genome browser like the UCSC Genome Browser or a genome browser program like IGV (Integrative Genomics Viewer; Robinson et al., 2001). IGV allows researchers to simply drag and drop their data from the desktop and visualize regions of interest. Alternatively, they may link bigWig files, hosted on an FTP server, to the UCSC Genome Browser. Researchers can use the genome browsers to search for genes/loci, compare tracks, and make screenshots for publication.

After peak calling is complete, it’s now time to perform downstream analysis. Because we’re working with an annotated genome, we can assign them to other specific features such as their nearest gene or that gene’s promoter region. Often, our clients are interested in a differential analysis that compares one group of samples to another to identify regions where the signal was significantly different. Using the R package DESeq2, we can get quantitative differences between samples at specific peaks (Love et al., 2014). Because these peak regions have been annotated to nearby genomic features, it is at this point that researchers can ask questions such as what genes are differentially regulated as a function of condition.

Back to Table of Contents

The Final Analysis

For a final bioinformatic analysis provided by Active Motif, researchers receive the FASTQ (raw unaligned reads), BAM (aligned reads), and bigWig (peak data) for all of their samples analyzed. Beyond that, we deliver a suite of graphics, annotation files, and genome browser screenshots that, when taken together, offer deep insight into the desired epigenetic question.

To learn more about our end-to-end Epigenetic Services, contact us!

Back to Table of Contents

About the author

Nick Pervolarakis, Ph.D.

Nick grew up in a town called Lake Orion, Michigan and as the name implies spent a lot of time on the water. He graduated from the University of Michigan, Ann Arbor with a B.S. in Microbiology where he studied lung microbial communities through metagenomics. After receiving his degree, he began graduate school at the University of California, Irvine in the Mathematics, Computational, and Systems Biology program. His work there centered on applying single cell technologies to explore the mammary gland in a healthy and cancer context. At Active Motif, Nick works as a Computational Biologist and enjoys connecting with customers and their data to understand underlying biology through epigenetics. Beyond work, Nick enjoys reading, watching international films, and eating as many different varieties of food as he can get his hands on.

Library QC for ATAC-Seq and CUT&Tag AKA “Does My Library Look Okay?”

December 8, 2021
“Does my library look okay?” is the most common question posed to Active Motif technical support. Get the practical scoop on quality control for ATAC-seq and CUT&Tag libraries. Find out how much library to expect from different libraries, what they should look like, and what to do if it’s not as expected.

Read More

Complete Guide to Understanding Single‑Cell RNA‑Seq

March 4, 2021
Single-cell RNA-seq techniques have made it possible to study transcriptomics in heterogeneous samples; driving advances in our understanding of cancer, embryonic development and neurodegenerative disease. This article covers the history, protocols, and applications of Single-cell RNA-seq.

Read More

<< Back to MOTIFvations Blog Home Page

Name	Provider	Purpose	Expires
pint-checkbox-non-necessary	.activemotif.com	Remembers your selected cookie consent preference	3 months
pint-cookies-accepted	.activemotif.com	Remembers that you have made a cookie preference	1 Year

Name	Provider	Purpose	Expires
intercom-device-id-*	.activemotif.com	Used by Intercom Messenger to store identifier for each unique device that interacts with the Messenger. Intercom uses this cookie to determine the unique devices interacting with the Intercom Messenger to prevent abuse.	9 months
intercom-id-*	.activemotif.com	Used by Intercom Messenger to store anonymous visitor identifier cookie.	9 months
intercom-session-*	.activemotif.com	Used by Intercom Messenger to store identifier for each unique browser session and is used to keep track of sessions.	7 days
intercom.intercom-state-*	.activemotif.com	Used by Intercom live chat function to recognise a visitor, in order to optimise the live chat functionality.	Persistent
__utma	.activemotif.com	This is a persistent cookie which expires in 2 years by default and distinguishes between users and sessions. It is used to track first visit, last visit, current visit, and number of visits to calculate new and returning visitor statistics. The cookie is updated every time data is sent to Google Analytics. The lifespan of the cookie can be customised by website owners.	a year
__utmb	.activemotif.com	Used to determines new sessions and visits and expires after 30 minutes. The cookie contains the timestamp of the exact moment in time when a visitor enters the website and is updated every time data is sent to Google Analytics. Any activity by a user within the 30 minute life span will count as a single visit, even if the user leaves and then returns to the site. A return after 30 minutes will count as a new visit, but a returning visitor.	30 minutes
__utmc	.activemotif.com	Contains a timestamp of the exact moment in time when a visitor leaves the website. This works with _utmb to calculate when you close your browser to calculate how long a visit takes.	Session
__utmt	.activemotif.com	Used to throttle the request rate for the service (limit the collection of data on high traffic sites)	10 minutes
__utmz	.activemotif.com	This cookie keeps track of entry point into your website storing traffic source, medium, campaign, and search term used to land on your website - so Google Analytics can tell site owners where visitors came from when arriving on the site. The cookie has a life span of 6 months and is updated every time data is sent to Google Analytics.	6 months
_ga	.activemotif.com	Contains a unique identifier used by Google Analytics to determine that two distinct hits belong to the same user across browsing sessions.	a year
_ga_*	.activemotif.com	Contains a unique identifier used by Google Analytics 4 to determine that two distinct hits belong to the same user across browsing sessions.	a year
_gcl_au	.activemotif.com	Used by Google AdSense to understand user interaction with the website by generating analytical data.	3 months
IDE	.doubleclick.net	Used by Google’s DoubleClick to serve targeted advertisements that are relevant to users across the web. Targeted advertisements may be displayed to users based on previous visits to a website. These cookies measure the conversion rate of ads presented to the user.	a year
test_cookie	.doubleclick.net	A session cookie used to check if the user’s browser supports cookies.	15 minutes
pardot	pi.pardot.com		Session
lpv*	info.activemotif.com		30 minutes
visitor_id*	.pardot.com, .activemotif.com		a year
visitor_id*-hash	.pardot.com, .activemotif.com		a year

Enabling Epigenetics Research

Enabling Epigenetics Research

Life in the FASTQ Lane

Introduction

The Sequencer ‐ Where it Begins

Quality Control ‐ Check Your Work!

Peaks and Valleys

The Final Analysis

About the author

Nick Pervolarakis, Ph.D.

Related Articles

Library QC for ATAC-Seq and CUT&Tag AKA “Does My Library Look Okay?”

Complete Guide to Understanding Single‑Cell RNA‑Seq

Featured Articles

Product Guides

Epigenetic News

Technical Downloads

Life in the FASTQ Lane

Introduction

The Sequencer ‐ Where it Begins

Quality Control ‐ Check Your Work!

Peaks and Valleys

The Final Analysis

About the author

Nick Pervolarakis, Ph.D.

Related Articles

Library QC for ATAC-Seq and CUT&Tag AKA “Does My Library Look Okay?”

Complete Guide to Understanding Single‑Cell RNA‑Seq

Featured Articles

Product Guides

Epigenetic News

Technical Downloads

Cookie Settings