Genome Sequencing: Reading the code of life

In the earlier approaches to R-DNA technology, the 1st step was to identify the gene of interest.
But this was very tedious because we had to cut the entire genome into smaller pieces (like a shredder), make them express in plasmids and then identify the gene of interest like a needle in the haystack.
With genome sequencing this changed. We could identify the exact location of the base pair.
Genome sequencing in simple words is reading the entire book of genome letter by letter (base by base).
In other words, it is a procedure for determining the linear order of nucleotide bases in DNA.
In 1977 the 1st bacteria was sequenced and between 1990-2001 human genome was sequenced for the 1st time.

Advancements in genome sequencing

Short-read: Earlier approach to sequencing was called short read method where the genome was chopped into small fragments which could be reassembled like a jig-saw puzzle.
This approach is also called whole-genome sequencing.
In this approach we read 150 bases at a time.

Reading short segments of the genome was time-consuming and labour intensive.
Besides codes of many fragments looked same making it difficult to tell them apart.
Long-read: With the use of nanotechnology, we could read longer sequences of the genome. We now can read 2.3 million base long sequence at once.
The number of bases that are sequenced is 10000-times more than short-read. We are now hoping to read the entire chromosome in one read in times to come.

Advantages

Faster and cheaper
Suitable for identifying pathogens in food, in disease control and diagnosis of infection.

Next-Generation Sequencing

Use of computers read multiple fragments at the same time in an automated process.
Making whole-genome sequencing relatively faster, accurate, automated and cheap.

Human Genome Project

It took 10 long years to read the sequence of human genome.
HGP started in 1990 and the result was published in 2001

Significance of HGP

Until 2001our understanding was one gene would code for one protein responsible for one trait.
With HGP this was revised. We now know there are about 20500 genes coding for 80000-100000 proteins.
Also, more than one gene is responsible for one trait.
Only 1.5% of the genome was coding for proteins. Other 98% is the non-coding part.

Findings of HGP

Junk DNA

Initially we thought it had no role to play and hence it was called ‘Junk DNA’ or ‘dark area of the genome’.
Now we know the so-called dark genome has a significant role in regulating how genes function which is very important in gene expression.
As we have seen DNAàRNAàProtein is not a simple 2-step process but a dynamic one.
Also, it is the gene expression that is more important than the gene itself.

Constituents of dark genome

Non-coding part has a significant role in every minute step during transcription and translation, in effect the dynamic part of gene expression.
Recently we have identified the dark genome responsible for bipolar and schizophrenia.
Though we still do not have a complete understanding of our genome and our search is on and following are few important findings about which we know:

Non-coding RNA (ncRNA)

A distinct feature found all over our genome are non-coding RNAs.
Plays important roles in gene expression and regulation.

Application

Scientists are exploring the use of ncRNA as potential biomarkers for cancer diagnosis and treatment.
ncRNA may also be used to develop new therapies for diseases such as viral infections and autoimmune disorders.

Long non-coding RNA (lncRNA)

RNA with more than 200 bases.
Important in regulation of splicing of introns and exons during transcription

Application

Scientists are investigating their role in the development of cancer and other diseases.
lncRNA may be used as a target for new therapies aimed at altering gene expression.

Small non-coding RNA

Less than 200 bases.
Involved in post-transcription regulation.
2 types of sncRNA are discussed below:

Short-interfering RNA

It can knockout selected genes, in effect it can silence a gene.
Extremely important in RNA interference technology: future of disease management

Micro-RNA

Around 1000 identified in mammalian cells with 21 bases each
Post-transcriptional regulation
A single mi-RNA can regulate more than one messenger RNA.
It essentially cleaves or degrades m-RNA so that it cannot be translated.

Transposons or Jumping genes

Half of our genome is made of sequences that can hop from one place to another in our genome.
This virus-like genetic material that can move from one location to another in our genome is called jumping gene or transposons. (it is incorrect to call them genes)
They are very important to understand mammalian evolution.
They also have potential applications in gene therapy, where they can be used to deliver genes to specific cells.

RNA interference technology/ Gene Silencing

RNA interference technology is the use of siRNAs to silence a gene expression.
Under this you inject siRNA that binds to the m-RNA and attacks it before it reaches ribosome.

Application

Cancer treatments
Pest resistance
Can be used to manufacture pesticide.

Human Genome Project: Write: Recoding the Code of Life

An international project launched in 2016 to synthesise a human genome from scratch. While the original HGP aimed at “reading” the book of genome, HGP-Write aims at writing the genome.
HGP-Write essentially helps in ‘recoding’ our genome for many applications including to alter our susceptibility to disease, our ability to respond to drugs etc.

Illustration: Recently HGP-Write is trying to ‘recode’ human genome to develop immunity to any virus

HGP of 2001 revealed there are about 20000 genes coding for 80000-100000 proteins.
Besides a set of 3 letter base (codon) codes for 1 amino acid.
All the proteins we are familiar with are made of 20 amino acids. But the total number of possible codons is 64 (4X4X4).
Out of the rest 44 many are redundant, meaning they code for same amino acid. For instance, GGT, GGC, GGA, and GG all code for same amino acid glycine.
We know viruses depend on the host cell machinery like tRNA to make its proteins.
If you removed all redundant codon and the tRNA machinery that decodes it, viruses would not be able to translate their genes into proteins.
Thus, our recoded cells would be immune from viral attack.
This would require at least 400,000 changes to the human genome.

Important Genome Sequencing efforts and their significance

Name	Significance
ENCODE (Encyclopedia of DNA Elements) FANTOM (Functional Annotation of the Mammalian Genome)	Genome sequencing project that has given significant details of the dark area of genome.
Genome India Project	Aim is to create reference genome of Indians. Problem with HGP is that it is majorly sourced from white population because of which diversity is not accounted for. Launched in 2019 to do whole-genome-sequencing of over a million Indians from diverse ethnic groups. Application: To understand Indian population’s susceptibility to disease and develop precision medicine and personalized healthcare.
INDIGEN	A research project to understand the genetic diversity of the indigenous population. of India. Main aim is to gain insights into genetic ancestry and diversity of indigenous population.
Genomic in Indian agriculture	Wheat Genome Sequencing ProgrammeRice Functional GenomicsNational Plant Gene Repository programmeNext Generation Challenge Programme on Chickpea GenomicsINDIGAU: Genomics of cow
National Genomic Grid	Aims to collect cancer cells and tissues to facilitate cancer research in India. Uses genome sequencing of these high-quality cancer cells to study genomic factors influencing cancer in Indian population.
Earth Bio-genome Project	International effort to sequence and digitize the genomes of every eukaryotic biodiversity on Earth over a period of 10 years.It is an open-source DNA database. Application Planning environmental conservation initiatives. Issue May lead to digital bio-piracy (because it is open-source) which is against the principle of Nagoya protocol to convention of Biodiversity that requires sharing of benefits with the local communities.
Human Microbiome Project	Effort to study the genes of microbes in human body including gut, skin, oral cavity and vagina to study their role in human health and diseases. Note: Human body contains 10 times as many microbes as human cells.
DeepVariant: AI in genomics	Google’s AI system that converts sequencing data from high-throughput sequencing into an accurate picture of the entire genome. This it does by automatically identifying insertions, deletions or any such gaps that are to be filled
AlphaFold	Google’s AI system that is capable of predicting protein modelling. Important to understand diseases and corresponding drug development

Variations in the genome: Identity markers

Humans are 99.9% identical in genome to one another.
But given the size of human genome (3-billion base pairs), even a small (0.1%) proportion of variation is huge.
The variations are very important to decide the complete make-up of an individual including ones eye colour to ones susceptibility to disease to one’s parentage and even ones ancestry etc.
The variation in the base is called polymorphism.
Further variations can be of many types. It could be single base variation, large sequence variation, variations in the way sequences are structured etc.
Study of variations have interesting applications. (see table)

DNA Fingerprinting / DNA profiling

There are some sequences in our genome (15 to 100 bases) that keep repeating over and over again.
Say, like a word ‘green’ keeps on repeating in a book.It so happens that the number of times this repetitive sequence occurs on a chromosome differs.
These are called VNTR (variable number of tandem repeats) By counting the number of times the repetitive sequence occurs of chromosomes (both mother’s and father’s versions), we can establish the identity of an individual. This is the DNA fingerprint of that individual.

Application

Forensics
Establishing parentage

Single Nucleotide Polymorphism and Population genetics (related Nobel Prize 2022)

Single letter changes in DNA are called SNPs.

They occur throughout a person’s DNA, one in every 300 letters on an average.
It could occur due to mutation during DNA replication (cell division) or may be inherited.

Application

They can act as biological markers to locate a gene associated with disease.
Could help in dealing with future pandemics.
Some SNPs which are inherited act as markers to indicate ancestry. (very important in study of population genetics)
Essentially the higher the frequency of an SNP farther it is in the ancestry.
SNPs keep changing as they flow from one generation to another.
Tracing the gene flow by observing SNPs is one way of establishing ancestral links

Nobel Prize for 2022 (paleogenetics)

The scientist who won Nobel prize in 2022 had studied the gene flow from hominins to homo sapiens after the migration out of Africa, 70000 years ago.
He studied the gene flow of Neanderthals and Denisovan (both extinct hominin species)to homo sapiens.
It is important to understand our immune reactions to infections.

1000 Genome Project

It is an effort to study different variations in DNA including SNPs and also larger organizational variations in genome

Advancements in genome sequencing

Advantages

Next-Generation Sequencing

Human Genome Project

Significance of HGP

Findings of HGP

Junk DNA

Constituents of dark genome

Non-coding RNA (ncRNA)

Long non-coding RNA (lncRNA)

Small non-coding RNA

Short-interfering RNA

Micro-RNA

Transposons or Jumping genes

RNA interference technology/ Gene Silencing

Human Genome Project: Write: Recoding the Code of Life

Important Genome Sequencing efforts and their significance

Variations in the genome: Identity markers

Trending now