Photo by Bill Majoros.  Used with permission.



NEWS

  • ACE+, a probabilistic version of ACE, has been released - A new version of ACE has been released that includes a probabilistic model called SGRF for ranking splicing predictions.  Go to the page for SGRF/ACE+ for more information.
  • Fixed issue with temp files - If you downloaded ACE prior to April 2017 and you run multiple instances of ACE in the same directory, you may encounter filename clashes with temp files.  Please download the newest version of the perl submodule: cd perl ; git pull
  • Python parser for ACE output - the Essex parser (currently available for both Perl and C++) for mining ACE output has been ported to Python and is now in beta testing - you can download the beta version here: https://github.com/bmajoros/python
  • Docker image - A pre-compiled Docker image is now available for easier installation and running of ACE on any platform that can run Docker.  First, install Docker and start the Docker app, then download the image using the command "docker pull genezilla/ace".  Then you can run any command in the ACE package by prefixing the command with "docker run -it genezilla/ace".
  • Example inputs/outputs - Example inputs and outputs are now available, including configuration files, reference sequences and annotations, and complete ACE annotations for a single individual.


Contents
 
Description
Example Outputs
Download
Installation
Running
Output Format
Parsing Outputs
Utility Scripts
Algorithms
Contact / Credits / Funding



Description


ACE (Assessing Changes to Exons) is a software suite designed to detect changes in gene structures in the genomes of individuals or strains, and to interpret those changes as to possible loss of function.  It has been applied to thousands of human genomes and to numerous plant samples. 
It was developed in the Reddy lab at Duke University in collaboration with the Yandell lab at the University of Utah and Michael Campbell at Cold Spring Harbor Laboratory.
    



     ACE begins by constructing an explicit genome sequence from an indexed VCF file and a reference genome.  ACE utilizes the VCF file to rapidly infer an alignment of the reference sequence to the individual genome and uses the alignment to compute a coordinate transformation to be used to map annotations from the reference to the individual genome. 
     ACE analyzes all projected splice sites for changes to either the consensus sequence or to flanking bases that influence splice site strength.  ACE reports all splice sites that are disrupted or substantially weakened in the individual genome, and enumerates possible alternate splice patterns that may result from loss of the reference splice site, including those arising through exon skipping, intron retention, or cryptic-site activation (exon skipping and intron retention can be individually enabled or disabled via a configuration file). Noncoding genes are included in this analysis.  For protein-coding genes, ACE additionally analyzes the reading frame in each isoform and reports any change that may result in protein truncation or mRNA decay.  Unlike comparative gene finders that model homologous genes in different species and assume conserved reading frames, ACE is designed to compare genomes of the same species, and is specifically designed to identify genes that may have become inactivated or changed in function via recent sequence modifications.

     When processing VCF files, ACE can apply single-nucleotide variants, insertions, deletions, and short copy-number variants.  ACE left-normalizes variants and resolves overlapping variants in the same haplotype that are compatible.  ACE provides multiple filtering levels for sequences containing questionable variants (e.g., any variant overlapping an inconsistent variant phased to the same haplotype).  Because ACE uses indexed VCF files and indexed genome files, it can extract individual genes efficiently, enabling gene-level parallelization for processing large samples.  ACE applies all variants within a gene that are present on a haplotype simultaneously, so that the predicted changes to splicing patterns and/or reading frames reflect the combined effect of any and all variants that may interact (for example, when one variant alters a reading frame and one or more downstream variants correct it).  ACE can detect the loss of start codons resulting from either splicing changes or directly from sequence variants, and applies a scanning model to identify likely alternate start codons; a retrainable scoring matrix can be used to ensure that candidate start codons have appropriate flanking bases (e.g., a strong Kozak sequence).  ACE can detect both premature stop codons and variants that result in an absence of a stop codon anywhere in the reading frame, either of which may lead to mRNA decay and partial or complete loss of function.  ACE is highly configurable and includes options for use on plant genomes, such as prediction of intron retention.  ACE produces structured output in multiple formats, including XML, and provides parsers and command-line utilities for convenient extraction of specific information and summary statistics.


Example Inputs and Outputs

Example input and output files, including sequences and annotations, can be downloaded here:

http://www.geneprediction.org/ACE/examples.html


Download

ACE is open-source software, and can be downloaded from GITHUB:

https://github.com/bmajoros/ACE

Note that in order to compile ACE you will first need to install the Gnu Scientific Library (GSL).  (If you install via Docker, the GSL will be automatically installed for you.)
     ACE is designed to run on Linux, but it can run on any platform via Docker.  It is written in C++ and Perl.


Installation

Prerequisites for Installing ACE

The following are required to install ACE directly on your system (however, installation via our pre-compiled Docker image is simpler--see below):
  • A Linux system is required, with a gcc/g++ compiler and linker, and Perl.  It may also run on other UNIX systems, with modification.
  • Installation must be on a case-sensitive filesystem
  • The Gnu Scientific Library (GSL) must be installed.
  • The program twoBitToFa from the Kent Utils package must be installed and on your executable path.
  • The program tabix from the samtools package must be installed and on your executable path.

Installing and Compiling Source Code

You can either install ACE directly on your hardware, or install it via a Docker image.  Installing via Docker is described below.  To download and install source code from GITHUB, copy each of these commands and paste them onto your Linux command line:

git clone --recursive https://github.com/bmajoros/ACE.git
cd ACE
make all

If you encounter any compilation errors, there may be a mismatch between compiler or library versions.  In that case we recommend you install via Docker.

IMPORTANT UPDATE: Please add the full path to the ACE/perl directory created above to your $PATH environment variable, so that ACE scripts can call programs in the ACE/perl directory.

Next, please set these environment variables:
  • $TMPDIR : absolute path to a location where temporary files can be made (this might already be set by your system administrator)
  • $ACE : absolute path to the ACE directory created above
  • $PERLLIB : absolute path to the ACE/perl directory created by the git clone command
  • $PYTHONPATH : absolute path to the ACE/python directory created by the git clone command
  • $GSLDIR : absolute path to the Gnu Scientific Library
If you use the tcsh or csh shells, this can be done using the setenv command; if your shell is bash, you'll need to use the set and export commands.  It is recommended to put these commands in your .cshrc or .bashrc file so they run every time you log in.  Note that your system administrators may have already set $TMPDIR for you.
      You may want to add the ACE directory to your executable path, so that commands can be run without specifying their location.

Installing ACE via a Docker Image

To install ACE via a Docker image, you first need to install Docker on your system.  Once Docker is installed and the app is running, you can either download the 1 GB pre-compiled image file, or build the image yourself from a Dockerfile.  To download the pre-compiled image:

docker pull genezilla/ace

If you instead prefer to build a docker image yourself, download the Dockerfile for ACE and build it using the docker build command:

wget https://raw.githubusercontent.com\
/bmajoros/ACE/master/Dockerfile


mkdir ace

mv Dockerfile ace

docker build -t genezilla/ace ace


Once you have a compiled image (either downloaded via docker pull or built manually using docker build), you can execute any ACE command by first prefixing it with: docker run -w /root -v X:/root -it genezilla/ace, where X is the absolute path to the current directory where you're working.  For example, to build personal genomes from a VCF file, the command would be:

docker run -w /root -v DIR:/root -it genezilla/ace make-individual-genomes.pl

where DIR is the absolute path to the current directory where you're working.

(Thanks to Alex Barrera for help setting up the Docker image for ACE)



Running

Prerequisites for Running ACE

The following are required to run ACE:
  • A set of reference annotations in GTF format.  Note that GFF3 is not yet supported (but will be in the next release).  Every exon must have a transcript_id and a gene_id. Coding segments must be labeled as CDS, and untranslated regions/exons (in coding or noncoding genes) should simply be labeled exon.  An incomplete list of some sources of annotations are provided below.
  • A reference genome sequence in 2bit format.  You can convert a FASTA file to 2bit using the faToTwoBit utility from KentUtils.
  • One or more phased, indexed, bgzipped VCF filesYou can bgzip your VCF files using the bgzip utility from htslib; gzip is not sufficient. You can index your bgzipped VCF files using tabix.  Your VCF must be sorted before tabix will index it; you can use the sort-vcf.pl script in ACE to sort it if necessary.

Creating a Configuration File

Before running any of the programs in ACE, you must create a configuration file.  A sample configuration file is included in the ACE directory, called sample.config, which you can customize:

donor-consensus         = GT,AT,GC
acceptor-consensus      = AG,AC
start-codons            = ATG
stop-codons             = TGA,TAG,TAA
max-splice-shift        = 70
min-exon-length         = 30
min-intron-length       = 30
NMD-distance            = 50
allow-exon-skipping     = true
allow-cryptic-sites     = true
allow-intron-retention  = false
start-codon-model       = start-codons-12bp.model
short-start-codon-model = start-codons-3bp.model
stop-codon-model        = stop-codons-3bp.model
donor-model             = donors57-100.model
acceptor-model          = acceptors57-100.model
gap-open-penalty        = 5
gap-extend-penalty      = 10
subst-matrix            = pam10
bandwidth               = 50
ploidy                  = 2
margin-around-gene      = 1000
min-orf-length          = 150
twoBitToFa    = /usr/local/bin/twoBitToFa
tabix         = /usr/local/bin/tabix
genome        = /data/reddylab/Reference_Data/hg19.2bit
chr-lengths   = /data/reddylab/Reference_Data/hg19.genome
gender        = /home/bmajoros/1000G/vcf/gender.txt
individuals   = /home/bmajoros/1000G/assembly/fasta/Geuvadis.txt
vcf           = /home/bmajoros/1000G/vcf
vcf-lacks-chr = true

These should be largely self-explanatory, except for the following:
  • max-splice-shift is the maximum distance from a disrupted splice site that ACE will search for a cryptic site to use in place of the disrupted site; for humans, a distance of 70 to 150 is recommended (i.e., 70 to favor specificity, 150 to favor sensitivity).
  • NMD-distance refers to the nucleotide distance (on the spliced transcript) between a stop codon and the most 3' exon junction; stop codons further upstream than this distance may trigger NMD.  This option allows the default of 50 bp to be overridden.
  • Default *.model files are included in the ACE directory for use on human genomes; for other organisms these must be retrained, as described later.  The short-start-codon-model captures only the three positions containing the start codon (usually ATG), for use in contexts where flanking positions may fall outside the annotated transcript.
  • gap-open-penalty, gap-extend-penalty, and bandwidth are parameters for aligning proteins (to detect large changes to protein sequence).
  • min-orf-length is used to detect protein-coding genes that are possibly misannotated as noncoding genes; the value is given in nucleotides.  150 provides reasonable discrimination between true and false positives in human genomes.  In non-human species, a reasonable cutoff can be determined by plotting the histogram of known ORF lengths versus ORFs randomly sampled from intergenic regions, and choosing a value that best separates the two histograms.  A set of scripts are provided to help with this.
  • pam10 is an amino-acid substitution matrix included in the ACE directory.  Please include the full path to this file.
  • twoBitToFa and tabix are the full paths to where these programs are installed on your system.
  • genome is the full path to the 2bit genome file on your system.
  • vcf is the path to the directory where bgzipped VCF files are located (along with *.tbi index files).
  • vcf-lacks-chr=true indicates that chromosome names within the VCF file are not prefixed with the letters chr as expected.  Please ensure that chromosome names are prefixed with chr in all other input files and in filenames where appropriate.
  • ploidy values of 1 or 2 are currently supported, and values >2 are currently in beta testing for the next release.
  • individuals is the full path to a file listing the identifiers of the individuals for which ACE is to perform analyses.  Note that all individuals in the VCF file must also be listed in this file.  In order to run ACE on a subset of individuals, your VCF file must include only columns for those individuals.
  • gender is required only for organisms having sex chromosomes: if present, it lists one individual per line:
HG00096 male
HG00097 female
HG00099 female
HG00100 female
HG00101 male
...     ...

  • chr-lengths is the path to a file giving the length of each chromosome, as illustrated below:
chr1    249250621
chr2    243199373
chr3    198022430
...     ...

You can create multiple configuration files.  This is particularly useful in order to use different models trained for different isochores (regions of uniform G+C content), as described below.

Making Individualized Genomes

Individualized genomes can be constructed using the following wrapper script in the ACE directory.  This wrapper script calls several C++ binaries in the ACE directory, so you must have the environment variable $ACE set to the proper path in order for this script to work:

make-individual-genomes.pl <ace.config> <genes.gtf> <out-dir>

Example input and output files can be found here.  The <ace.config> parameter is the path to the configuration file described above.  The GFF2 file <genes.gtf> specifies the coordinates of the genes to be included in the constructed genome.  Recall that in the configuration file the individuals element points to a file listing the identifiers of the individuals for which genomes are to be built.
    Please note that chromosome names must begin with the letters chr, and this must be consistent between the GTF/GFF2 files, the VCF files, and the chromosome lengths file.  Also, individual VCF files must contain the chromosome name in the filename, and they must be bgzipped (not gzipped---bgzip is distributed as part of the HTSlib project which is part of samtools).
    Also note that the GFF files must be in GFF2 format (not GFF3), which is also known as GTF format.  An example gene in GFF2 is shown below:

chr1    ensembl transcript      1002    17286   .       -       .      
   transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1    ensembl CDS    16644   17258   0       -       0      
   transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1    ensembl CDS   7631    8092    0       -       0      
   transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1    ensembl CDS   5869    5985    0       -       0      
   transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1    ensembl CDS   4778    4893    0       -       0      
   transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1    ensembl CDS   3443    3545    0       -       2      
   transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1    ensembl CDS   2294    2383    0       -       0      
   transcript_id=ENST00000378251;gene_id=ENSG00000130764;

chr1    ensembl CDS      1870    2118    0       -       0      
   transcript_id=ENST00000378251;gene_id=ENSG00000130764;

chr1    ensembl five_prime_UTR  17259   17286   0       -       0      
   transcript_id=ENST00000378251;gene_id=ENSG00000130764;

chr1    ensembl three_prime_UTR 1002    1869    0       -       1      
   transcript_id=ENST00000378251;gene_id=ENSG00000130764;


It is essential that each element have a transcript_id tag and a gene_id tag; these can either be formatted as above or as shown below:

transcript_id "ENST00000378251"; gene_id "ENSG00000130764"

Protein-coding genes should have CDS elements, whereas noncoding genes should simply have exon elements.  Remember that coordinates in GFF are 1-based and the end coordinate is not inclusive.
     It is recommended that you run normalize-gff.pl (in the $ACE directory) on your GTF file to ensure that coding and noncoding genes are properly annotated as such.
     The VCF file must also be well-formed.  An example CHROM line and variant line from a 1000 Genomes Project VCF file are shown below:

#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  HG00099
chr1  10177  rs367896724   A   AC  100  PASS  VT=INDEL GT  0|1

The variant line must contain PASS in the 7th column and GT in the 9th column, and ./. genotypes ("uncalled variants") are not permitted.  Extra fields in the genotype column are also not permitted.  If your VCF contains a null call for every position in the genome, these should be removed prior to running ACE.  Note that the substrate (chr1 in this case) should have a prefix of chr and this should be consistent across all VCF and GFF files.
     If your organism has sex chromosomes, such as X and Y, genotypes for these should be formatted as in the
1000 Genomes Project VCF files.  In particular, genotypes should have the same ploidy across all individuals, even for individuals having fewer (or no) copies.  For humans, a second, dummy allele should be added for males to produce two X alleles (as in the 1000 Genomes Project VCF file for chromosome X); the second allele will be ignored when males are processed (gender is obtained from the gender file listed in the configuration file).
    The output will be written to files in the output directory specified by <out-dir>.  Each file will be a multi-FASTA file containing the gene sequences for each individual; a ref.fasta file will contain the reference sequences for comparison.  When ploidy is greater than 1, a FASTA file will be generated for each haplotype; for example, for a diploid individual with identifier HG00096, there will be two output files: HG00096-1.fasta and HG00096-2.fasta.  A file called local.gff is also generated, which provides gene structure annotations with exon coordinates converted to local offsets for the corresponding reference sequences.
     In the resulting FASTA files for the individualized sequences, ACE stores information as /key=value elements on the defline.  These include the  coordinates of the gene on the reference chromosome, a CIGAR string giving the alignment of the reference sequence to the individualized sequence, the variants that were applied, and the numbers of warnings and errors generated by application of those variants:

>ENSG00000272636.1_1 /coord=chr17:4809-32427:- /margin=1000 /cigar=2501M1I2300M1D3390M2D5052M /variants= rs2223138:chr17:156:156:A:G, rs2894723:chr17:3104:3104:C:T, rs2396789:chr17:3737:3737:T:C, rs2894722:chr17:3758:3758:A:G, rs8081895:chr17:4551:4551:A:G, rs36068254:chr17:5773:5773:A:G, rs34663111:chr17:6933:6933:A:G /warnings=0 /errors=0


Mapping Annotations

Mapping of annotations to the individualized genome is performed by the ace binary, but it is more common to use the perl wrapper, ace.pl, which takes as input a configuration file and the output files produced by make-individual-genomes.pl:

ace.pl <config-file> <reference.multi-fasta>
  <individual.multi-fasta> <reference.gtf>

  <max-vcf-errors>
<out.essex>

where:
  • <config-file> is either a configuration file like the one shown above, or a directory containing a set of configuration files, one per isochore (G+C content interval).  In the latter case, the configuration files must be named ace.xx-yy.config; for example, ace.43-51.config for sequences having an average G+C content between 43% and 51%.
  • <reference.multi-fasta> is the reference FASTA file generated by  make-individual-genomes.pl.
  • <individual.multi-fasta> is the FASTA file containing an individualized genome sequence for a sample.
  • <reference.gtf> provides reference annotations to be mapped; in most cases this will be the local.gff file generated by make-individual-genomes.pl.
  • <max-vcf-errors> is the maximum number of VCF errors allowed before triggering a too-many-vcf-errors flag in the output
  • <out.essex> is the output file, which is a structured file that can be queried using provided scripts, or converted to XML or GFF for import into other programs.
Example input and output files can be found hereAs mentioned above, the output is a richly structured and machine-readable file containing detailed information regarding gene structures in the reference and individualized sequences, their differences, and possible repercussions of those differences (such as loss or change of function).  The output format is described below, as is the provided Perl library for parsing that output, and command-line utility scripts for extracting particular types of information.  One script in particular, essex-to-xml.pl, converts the output to XML so that external XML-processing software can be used instead of the provided utilities, if desired.

Sources of Annotations

Below are some popular sources of genome annotations for various species.  Note that while ACE has not been tested on all of these, in principle ACE can accept annotations from any source, as long as they are formatted properly in GTF.

Source URL Genomes and Annotations Available
GENCODE https://www.gencodegenes.org/ Human and mouse
NCBI Genomes ftp://ftp.ncbi.nlm.nih.gov/genomes/ Multiple species
RefSeq ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ Multiple species (curated subset of NCBI Genomes)
ENSEMBL http://uswest.ensembl.org/info/data/ftp/index.html Multiple species
Joint Genome Institute http://genome.jgi.doe.gov/ Multiple species
PlantGDB http://www.plantgdb.org/ Multiple plant species
FlyBase http://flybase.org/ Drosophila species
WormBase http://www.wormbase.org/ Nematode species
Hymenoptera Genome Database http://hymenopteragenome.org/ Bee, wasp, and ant species


Parallelizing ACE on Your System


Because ACE processes each gene individually, parallelization on different cluster computing systems is simple, via batch processing of different gene sets.  Both make-individual-genomes.pl and ace.pl can be run on a GTF file containing only a subset of the genes to be analyzed.  By splitting the full gene set into N subsets (GTF files) and submitting N batch jobs to process those GTF files on N different compute nodes, processing can ideally approach an N-fold speedup.  Note, however, that processing long genome sequences is I/O intensive, and that an N-fold speedup may not be fully achieved if your I/O subsystem has limited bandwidth to compute nodes.

Training Signal Sensors


ACE requires very little training, and has very few parameters, as compared to traditional gene finders.  However, it is important that ACE's splice-site and start-codon models are properly trained for your organism.  The ACE package includes default models for human splice sites and start codons.  For non-human organisms, signal sensor models need to be trained from training sequences.  The get-examples.pl script in the $ACE directory can be used to extract a set of example splice sites, start codons, and stop codons to use for training.  The script requires a set of annotated genes (in GTF format) on a given substrate (FASTA file):

get-examples.pl <*.gtf> <*.fasta> <donor-consensuses> <acceptor-consensuses> <start-codon-consensuses> <stop-codon-consensuses>

The consensus parameters are comma-separated lists of acceptable consensus sequences for splice sites and start/stop codons.  Different organisms can have different consensus sequences for these signals.  For example, in human we might use these parameters:

$ACE/get-examples.pl genes.gff genome.fasta GT,GC,AT AG,AC ATG TAA,TGA,TAG

This script generates a file donors.fasta containing example donor splice sites, a file acceptors.fasta containing example acceptor splice sites, and files start-codons.fasta and stop-codons.fasta containing example start and stop codons.  These examples can be used to train a signal sensor for each type of signal (splice site, start/stop codon), using the train-signal.pl script:

train-signal.pl <model-type> <pos.fasta> <neg.fasta>
  <filestem> <% train> <signal-type> <consensus-offset>
  <consensus-length> <context-window-length> <min-sens>
  <order> <min-sample-size> <boosting-iterations>
  <boosting-percentile>

Most of these parameters can be ignored by setting them to default values.  For example, training a model for donor splice sites can be accomplished via:

train-signal.pl WMM donors.fasta non-donors.fasta donor 1 GT
  4 2 13 0.98 0 0 0 0


This will train a simple PWM model (no dependencies between columns) and place it into a file called donor.model.  It will use all of the examples in donors.fasta to train the model (rather than withholding some examples for independent testing of the trained model), because the fifth parameter (1) specifies that 100% of the examples should be used for training.  The consensus-offset, consensus-length, and context-window-length parameters are illustrated by the figure below:



Thus, for the above command the model will capture 4 (the 7th parameter) positions upstream of the GT consensus (the "consensus offset"), with a full window size of 13 (the 9th parameter), so that an additional 7 positions downstream of the consensus will also be captured (7 = 13 - 2 - 4, where 2 is the consensus length).  A PWM scoring threshold will be computed so as to ensure a sensitivity of 98% (the 10th parameter) on the training set.  All other parameters are set 0 (conditional probability order, minimum sample size, and boosting parameters).
    A similar command line for acceptor splice sites would be:

train-signal.pl WMM acceptors.fasta non-acceptors.fasta
    acceptor 1 AG 5 2 12 0.98 0 0 0 0


To model dependencies between columns of the PWM, you can use change WMM to WAM, set the order parameter to 1 or higher, and set a minimum sample size (for estimating conditional probabilities) to 30 or more.
    A default command line for start codons would be:

train-signal.pl WMM start-codons.fasta non-start-codons.fasta
  start-codon 1 ATG 6 3 11 0.98 0 0 0 0

and a default command for stop codons would be:

train-signal.pl WMM stop-codons0-100.fasta
  non-stop-codons0-100.fasta stop-codons0-100 1 TAG 0 3 3
  0.98 0 0 0 0


Choosing A Minimum ORF Length

The min-orf-length parameter is used to detect ORFs in genes annotated as noncoding, as a means of detecting possible mis-annotated genes.  For human genes, 150nt provides a good cutoff for separating true positives from false positives.  For nonhuman organisms, you can use the script random-orfs.pl in the ACE/perl directory to sample a set of random ORFs from a chromosome FASTA file and compile their lengths:

$ACE/perl/random-orfs.pl chr1.fasta TGA,TAA,TAG 1000 > random-orf-lengths.txt

where TGA,TAA,TAG is the list of functional stop codons for your organism.  Similarly, you can compile a list of real ORF lengths using the following commands:

$ACE/perl/get-transcripts.pl genes-on-chr1.gff chr1.fasta transcripts.fasta
$ACE/perl/fasta-seq-lengths.pl transcripts.fasta > orf-lengths.txt


where genes-on-chr1.gff is a GTF (GFF2) file containing gene annotations that include CDS elements for protein-coding genes.  You can now use a tool such as R to plot the distributions of orf-lengths.txt and random-orf-lengths.txt and then choose a length cutoff that separates these two distributions:

d1<-read.table("orf-lengths.txt")
d2<-read.table("random-orf-lengts.txt")
h1<-density(d1$V1)
h2<-density(d2$V1)
plot(h1)
lines(h2)



Algorithms

Reconstructing Individualized Genomes

ACE reconstructs individualized genomes from a VCF file.  For diploid or higher-ploidy organisms, the VCF file must be phased to the appropriate ploidy.  All VCF files must be zipped with bgzip and must be indexed with tabix.  Using only indexed VCF files allows the genome building software to keep memory requirements to a minimum by allowing it to extract only the variants needed, using the program tabix.  Given a GTF file containing coordinates of genes, the program builds a genome consisting of sequences for those genes.  The result is a multi-FASTA file in which each element is a gene sequence.
     ACE applies any variant that is a single-nucleotide variant, an insertion, a deletion, or a copy-number variant.  All variants are left-normalized and then disambiguated with other variants that may be overlapping (due to genotyping errors or merging of variant sets from multiple sources).


Validating Reference Annotations

For each gene annotation to be mapped from the reference sequence to the individualized sequence, ACE verifies that the annotation includes only consensus splice sites, a consensus start codon, and a consensus stop codon (consensus sequences for these elements can be specified in the configuration file, enabling use on non-human organisms).  For any transcript failing this validation step, or for any reference transcript predicted to incur nonsense-mediated decay (see below), the transcript is still mapped to the individualized sequence, but inferences about the status of the mapped transcript (such as broken splice sites or premature stop codons) are suppressed, and a warning is issued regarding the annotation anomaly.

Mapping Annotations

ACE extracts the CIGAR string from the defline of the FASTA file (see above) and instantiates an explicit alignment from the reference sequence to the individualized sequence.  This alignment allows ACE to map the coordinates of all reference annotations to the individualized coordinate system via a simple coordinate transformation.  Prior to mapping, ACE reverse-complements the sequences and annotations as necessary for reverse-strand features; thus, the following descriptions are all relative to the sense strand.  Reverse-strand features are reverse-complemented again before outputs are generated, so that outputs will be on the same strand as the reference annotations.  For genic features, ACE maps the TSS (transcription start site), TES (transcription end site), start codon, and all splice sites from the reference sequence to the individualized sequence using the coordinate transformation given by the alignment.  Both coding and noncoding exons are mapped (including UTR, or untranslated regions).
     To determine mapped exon boundaries, ACE first maps the initial (5') nucleotide in each splice site and computes new exon boundaries relative to these sites (e.g., two nucleotides 3' of an acceptor splice site, one nucleotide 5' of a donor splice site).  ACE maps the initial position of the start codon (in the case of protein-coding genes) as the beginning of the mapped coding segment; if no valid start codon is found at that position, ACE scans downstream with a positional weight matrix to find an alternate start codon (see below).  If any genetic variants were applied in the 5' UTR, ACE will also scan for upstream start codons that may have been created or strengthened as a result of those variants.  Scans for start codons occur on the spliced transcript, so that start codons interrupted by an intron will be found.
     ACE then begins at the mapped start codon and iterates over nucleotide triples (codons) in the spliced transcript until a valid stop codon is found (the set of valid stop codons are specified in the configuration file); ACE permits the stop codon to be interrupted by an intron.  Note that a stop codon may be found in sequence that was formerly annotated as UTR (untranslated region) in the reference.  Thus, ACE identifies the end of the coding segment de novo rather than simply mapping the reference coordinate, as the reference coordinate may be incorrect.

Detecting Disrupted Signals

ACE checks each mapped splice site against the legal consensus sequences in the configuration file.  If all sites are valid, ACE proceeds to analyze the coding segment (in the case of protein-coding genes), as described below.  If any splice sites are invalid (or significantly weakened as indicated by a positional weight matrix), ACE searches within a fixed distance (specified in the configuration file; the default is 70 nt) of the mapped position for a cryptic splice site that might be activated in place of the invalid site; only sites scoring above the threshold of a positional weight matrix (PWM) are considered.  If multiple splice sites are invalid, ACE considers all combinations of alternate splicing decisions for the invalid sites, to form a set of alternate splice patterns for the mapped transcript.  Alternate splice patterns are formed via combinations of cryptic splice sites (sites that score above the PWM threshold but are not annotated as splice sites), exon skipping, or intron retention; any of these can be enabled or disabled in the configuration file, as some of these are rare in some organisms than others. For each splice pattern, ACE analyzes the coding segment (for protein-coding genes) for possible frameshifts, premature stop codons, missing start codons, or missing stop codons, as described below.
     In the case of protein-coding genes, after mapping the position of the start codon, ACE determines whether an valid start codon is present at the target position; the expected start codon consensus is defined in a configuration file that can be modified for use on non-human organisms (multiple consensus sequences may be specified).  If a legal start codon consensus is not found at the target location, ACE scans downstream (3') through the spliced transcript for another start codon with a sufficiently strong match to the provided positional weight matrix (PWM) in the configuration file; ACE permits a start codon to be interrupted by an intron.  If a start codon is found, it is used to determine a starting position for the mapped coding segment (CDS).  If no start codon is found, ACE reports a loss of coding potential.
     ACE then iterates over nucleotide triples (codons) in the spliced transcript until a valid stop codon is found (the set of valid stop codons are specified in the configuration file); ACE permits the stop codon to be interrupted by an intron.  Note that a stop codon may be found in sequence that was formerly annotated as UTR (untranslated region) in the reference.  If no stop codon is found, ACE predicts that the transcript will suffer nonstop decay due to lack of a stop codon (however, nonstop decay is not predicted if the reference annotation lacked 3' UTR coordinates, as the unknown 3' UTR may contain a stop codon).
     Premature stop codons are detected as described below; ACE predicts protein truncation or nonsense-mediated decay depending on the location of the premature stop codon relative to the final exon junction in the spliced transcript.

Detecting Possible Loss of Function

ACE can detect multiple forms of possible loss of function in protein coding genes:
  • Excessive change to the amino acid sequence
  • Loss of coding potential
  • Substantial protein truncation
  • Non-stop decay
  • Nonsense-mediated decay
Large changes to amino acid sequences arising from frameshifts are identified by aligning the reference and mapped proteins using a banded affine dynamic-programming global alignment algorithm.  The parameters to the alignment algorithm (gap open penalty, gap extend penalty, bandwidth) are specified in the configuration file and can be changed from their defaults; a PAM10 amino-acid substitution matrix is used to compute the optimal alignment.  A percent-match score is computed as the number of exact matches divided by the length of the reference sequence.
     Loss of coding potential is detected via lack of a start codon in the individualized sequence.  As described above, when the mapped position of the reference start codon is no longer a start codon in the individualized sequence (according to the specified start codon consensus sequences in the configuration file), ACE scans downstream for an alternate start codon, using a PWM (positional weight matrix).  If no alternate is found, ACE reports a loss of coding potential.  The PWM can be trained for nonhuman species, as described above; default PWMs for human are provided.
     Non-stop decay is detected as a lack of a stop codon in the identified reading frame in the individualized sequence.  As described previously, ACE searches in-frame for a stop codon downstream of the mapped start codon.  ACE permits the stop codon to be in sequence that formerly was considered UTR (untranslated region).
     Nonsense-mediated decay is detected using the 50-nucleotide rule: after splicing out introns, if the first in-frame stop codon is more than 50 nt upstream of the last (most 3') exon junction in the spliced transcript, nonsense-mediated decay is predicted.  The default value of 50 can be changed in the configuration file.  Note that untranslated (UTR) exons are included in this computation.

Positional Weight Matrices

Positional weight matrices are used to score splices sites and start codons.  Scripts for training new matrices from training examples are included in the package.  A score threshold is computed by applying the trained model to a held-out test set and identifying the score that results in a desired sensitivity: TP/(TP+FN), where TP=true positives and FN=false negatives.
     ACE can utilize G+C content (isochore) in choosing which PWMs to use in evaluating splice site strength (and similarly for evaluating start codons).  Default models for human are provided; for human, the default G+C intervals are 0-43%, 43-51%, 51-57%, and 57-100%.

Output Format

The output of ACE is a structured file in Essex format, which utilizes constrained forms of S-expressions (as used in the programming language LISP).  A script (essex-to-xml.pl) is also provided to convert Essex to XML, for use with standard XML parsers.  The complex nature of the analysis performed by ACE mandates a richly structured output format in order to represent the full gamut of possible changes to gene structures.  Nevertheless, a script (essex-to-tsv.pl) is also provided to convert the output to a much simpler TSV (tab-separated values) format for use with spreadsheets and other tabular interfaces
     ACE processes each transcript separately, and produces a report element for each transcript.  Essex is hierarchically organized, so that under the report element are additional elements:

(report
   (substrate ENSG00000249915_1)
   (transcript-ID ENST00000505221)
   (gene-ID ENSG00000249915)
   ...

The substrate identifier is extracted from the identifier on the defline of the reference FASTA file.  When building the individualized sequence, ACE places information on the defline indicating the number of vcf-warnings (ambiguous variants that were resolved) and vcf-errors (ambiguous variants that were not resolved):

   (vcf-warnings 0)
   (vcf-errors 0)

The alignment element describes the alignment from the reference sequence to the individualized sequence, using a CIGAR string that includes I elements for insertions, D elements for deletions, and M elements for matches/mismatches:

   (alignment 17692M4I11041M2D2837M2D11837M)

The original annotation is specified with the reference-transcript element:

   (reference-transcript
      (ID ENST00000505221)
      (gene ENSG00000249915)
      (type protein-coding)
      ...

The elipses indicate that ACE also reports additional information such as the exon coordinates (which are zero-based, with the begin coordinate being inclusive and the end coordinate non-inclusive), the strand, UTR (untranslated) elements, and so on.  The coordinates of the reference transcript are given relative to the reference sequence. The mapped transcript is included via the mapped-transcript element:

   (mapped-transcript
      (ID ENST00000505221)
      (gene ENSG00000249915)
      (type protein-coding)
      (substrate ENSG00000249915_1)
      ...

The coordinates of the mapped transcript are given relative to the individualized sequence.
     The status element indicates the result of mapping:

   (status mapped)

A simple "mapped" indicates that the mapping was perfect (all splice sites remain valid, the start codon remains valid, and the stop codon remains valid).  When running ACE with the -q (quiet) option, transcripts that map perfectly are not reported.  For protein-coding genes, when the protein changes, ACE reports information about the change via additional elements under the status element:

   (status
      mapped
      (protein-differs
         (percent-match 16.56 477/2881 ref=2881 alt=770))
      (frameshift
         (nt-with-phase-mismatch 6424)
         (percent-phase-mismatch 74.3%)))

In this case, mapped indicates that all splice sites remain valid, but the protein-differs indicates that the protein is not identical to the reference protein.  The percent-match (number of exactly matching amino acids, divided by the length of the reference protein) is very small due to a frameshift affecting 6424 nucleotides in the original coding segment; 74.3% is the proportion of the reference coding segment that is out of phase in the mapped gene structure.
     A premature-stop element indicates that the stop codon is earlier in the mapped sequence than in the reference sequence:

   (status
      mapped
      (premature-stop NMD
         (EJC-distance 6019)


In this case ACE predicts nonsense-mediated decay (NMD) (the other possible outcome of a premature stop codon is protein-truncation).  NMD is generally assumed to occur if a premature stop codon occurs at least 50 nt upstream of the last exon junction, where distance is measured on the spliced transcript.  ACE reports the observed EJC-distance to enable further downstream filtering, if desired.
     When the 5' UTR has changed due to sequence variants, ACE scans upstream of the annotated start codon for a new upstream start codon scoring above the start-codon PWM threshold.  If it finds one, it reports that fact, along with the sequences and scores of both the new upstream start codon and the original downstream start codon, and it provides a reason for predicting the new start:

   (status
      mapped
      (new-upstream-start-codon
         (new-start ggaagc_ATG_gct -9.89305 cutoff: -14.0648)
         (old-start cagctg_ATG_gac -13.6501 cutoff: -14.0648)
         (reference bad-consensus)
         (ORF-length 1173 => 1230)
         (transcript...

In this case the reason for predicting the new start was that the new start codon did not exist in the reference sequence (i.e., the reference did not have a valid start-codon consensus sequence).  The other possible reason for predicting a new start is that the start codon existed in the reference, but the flanking bases were "weak", meaning that the sequence scored below the PWM threshold in the reference sequence but above the PWM threshold in the alternate sequence.  ACE also reports the difference in ORF length between the downstream ORF and the new upstream ORF.  It then provides a revised transcript structure based on the new ORF.  If the new transcript structure is predicted to cause NMD or protein truncation, that is reported inside the transcript element.  Scans for upstream start codons also occur for predicted novel splice forms upon disruption of a splice site (see below).
     Note that the original, mapped transcript structure is still provided, further down in the report (not shown in the example above) as the mapped-transcript.  Upstream start codons can denote the beginning of a uORF, or upstream open reading frame.  Many human genes have uORFs in addition to their primary annotated ORF, and while these uORFs typically have stop codons early in the transcript, these do not always induce degradation via NMD.  For that reason, when a uORF is detected and NMD is predicted according to the 50 nt rule for downstream ORFs, ACE will report this as hypothetical-NMD to denote that it is much less confident that NMD will actually occur.
     When a splice site in the mapped transcript is invalid (according to the consensus sequences specified in the configuration file), ACE reports the disrupted splice site and generates alternative transcript structures, if any are feasible:

   (status
      splicing-changes
      (broken-donor 10891 ccagca_GC_gtctgccctg -27.2988
         ccagca_GG_gtctgccctg -31.4097 threshold: -21.591)

      (alternate-structures
         (transcript
            (structure-change exon-skipping)
            (ID ENST00000531190)
            (gene ENSG00000110455)
            (type noncoding)
            ...
 
ACE reports the mapped location of the splice site, the consensus and flanking sequences in both the reference and the individualized genomes, the PWM scores for those sequences, and the PWM threshold that was applied.  In this case a disrupted donor splice site is detected, and ACE predicts that the exon will be skipped.  ACE also searches within a fixed distance (default: 70 nt) of the original site for a cryptic site, and if it finds one, ACE reports a cryptic-site element:

   (cryptic-site acceptor 8498 tcatctctgagggcccttac_AG_g
      -26.875 threshold: -32.1488)


     ACE reports all possible alternate structures that may result when an annotated splice site is disrupted; when multiple splice sites are disrupted, ACE forms all combinations of alternatives for the disrupted sites via Cartesian project.  In practice, there are typically very few alternative structures.  For each alternate gene structure generated by ACE, a fate element indicates the predicted outcome of splicing the transcript in that way:

   (fate noncoding)

In this case, the skipping of the first coding exon has removed the start codon, and as ACE was unable to find a downstream start codon with a sufficiently strong Kozak sequence (as evaluated via a retrainable positional weight matrix), ACE predicted that the modified transcript is now noncoding.  Note that when a disrupted splice site occurs in the first or last exon of the transcript, exon skipping is not possible; if intron-retention is enabled in the configuration file, ACE will generate an alternative transcript with an intron retention event. Otherwise, ACE will indicate that transcription is likely aborted due to an unresolvable splicing outcome from the disrupted splice site:

   (status
      no-transcript
      (broken-donor 2716 GA))


ACE detects not only splice sites that have invalid consensus sequences, but also sites that have become weakened due to variants in nucleotides flanking the consensus, using a positional weight matrix.  ACE reports only sites that have weakened to the extent that the reference site scored above the matrix threshold and the new site scores below the threshold, and the probability of the new site is less than half that of the reference site.  Scores are shown on a natural log scale:

   (weakened-donor 37043 cctggg_GT_aaggggccat -19.5229
                         cctggg_AT_aaggggccat -23.887
                         threshold: -21.591)


In this case, a donor splice site at position 37043 was found to be weakened via a change to the consensus sequence (in this case the new consensus is still valid, as it still matches a consensus listed in the configuration file).  The weight matrix score decreased from -19.5 in the reference sequence to -23.9 in the individualized sequence; the score has thus changed from being above the threshold (-21.6) to being below it, and may therefore indicate likely reduction in splicing activity at this site.
     If ACE fails to find a valid start codon, it reports a no-start-codon condition:

   (status
      mapped
      no-start-codon)

If ACE fails to find an in-frame stop codon, it reports that the transcript will likely be degraded via nonstop decay:

   (status
      mapped
      nonstop-decay)

For transcripts annotated as noncoding, ACE searches for an open reading frame of sufficient length, as set by the min-orf-length parameter in the configuration file.  If an ORF is found, ACE also scans the reference transcript to see if the ORF existed in the reference.  If the ORF did not exist in the reference, a flag of ref-no-start-codon is reported.  If the ORF existed in the reference but was shorter than min-orf-length, and if the new ORF is at least as long as min-orf-length and is at least twice as long as the ORF in the reference sequence, then the new ORF is marked via the flag ref-ORF-too-short.  If the ORF is long enough in both the reference and alternate sequences, ACE reports possible-misannotation, as the reference may be incorrectly annotated as noncoding.  Note that these ORFs are required to begin with a valid start codon, but they may lack a valid stop codon; this is to allow for the possibility that the annotation is incomplete at the 3' end.
     Variants occurring within any transcript, whether the reference transcript, the mapped transcript, or putative alternate transcripts, are reported via the variants element.  These are then grouped according to the context of the variant, so that variants within a coding segment (CDS), within UTR, within a splice site, or near a splice site (within 50 nt) are reported together.  Other than variants near a splice site, intronic variants are not currently reported in the Essex output, though they are present on the defline of the generated FASTA file.  For each variant, the variant identifier, original substrate (chromosome), reference location, mapped location, reference allele, and alternate (applied) allele are given:

(CDS-variants rs29674:chr5:23887:23887:A:C ...
(UTR-variants rs4956990:chr5:43761:43762:A:G ...
(near-splice-variants rs10794716:chr10:47730:47717:T:C ...
(splice-site-variants rs10853954:chr19:10969:10937:T:C ...

Note that the classifications of variants (CDS, UTR, etc.) may differ between the transcripts of a gene, as differences in gene structure may change the context of a given variant.
      The complete structure of the Essex output produced by ACE is described by this schema, which lists all permissible parent and child relationships for pairs of legal Essex tags.

Parsing the Structured Output

The following Perl modules (found in the directory where you cloned git@github.com:bmajoros/perl.git) are provided for parsing the structured Essex output.  Alternatively, you may wish to parse the XML output using a publicly-available XML parser; the XML and Essex outputs contain the same information.  Converting to XML is described below (using the essex-to-xml.pl script in the $ACE directory).

UPDATE: Essex parsers are now available in Perl, C++, and Python.  The C++ parser is included in the main ACE repository.  The Python parser is in beta testing, and can be downloaded here: https://github.com/bmajoros/python.


    The following example script illustrates the use of this library for converting an Essex file to GFF:

use EssexParser;
use EssexACE;
use Transcript;
my $parser=new EssexParser($infile);

while(1) {
  my $root=$parser->nextElem();
  last unless $root;
  my $ace=new EssexACE($root);
  my $transcriptID=$ace->getTranscriptID();
  my $status=$ace->getStatusString();
  if($status eq "mapped") {
    my $transcript=$ace->getMappedTranscript();
    print $transcript->toGff();
  }
  elsif($status eq "splicing-changes") {
    my $transcripts=$ace->getAltTranscripts();
    my $n=@$transcripts;
    for(my $i=0 ; $i<$n ; ++$i) {
      my $transcript=$transcripts->[$i];
      my $id=$transcript->getTranscriptId();
      $id="ALT$i\_$id";
      $transcript->setTranscriptId($id);
      print $transcript->toGff();
    }
  }
}

The EssexParser package parses an Essex file and returns a series of tree data structures representing successive elements in the file:

package EssexParser;
#   $parser=new EssexParser($filename);
#   $parser->close();
#   $tree=$parser->nextElem();   # returns root of the tree
#   $forest=$parser->parseAll(); # returns an array of trees

The tree data structures returned by the EssexParser are composed of EssexNode objects, which can be traversed or queried in various ways:

package EssexNode;
# $tag=$node->getTag();
# $n=$node->numElements();
# $elem=$node->getIthElem($i);
# $elem=$node->findChild($tag);
# $array=$node->findChildren($tag);
# $array=$node->findDescendents($tag);# always returns an array
# $elem=$node->findDescendent($tag); # returns node or undef
# $bool=$node->hasDescendentOrDatum($tagOrDatum);
# $count=$node->countDescendentOrDatum($tagOrDatum);
# $n=$node->countDescendents($tag);
# $bool=$node->hasDescendent($tag);
# $string=$node->getAttribute($attributeTag);
# $array=$node->getElements();
# $bool=EssexNode::isaNode($datum);
# $bool=$node->hasCompositeChildren();
# $node->print($filehandle);
# $node->printXML($filehandle);
# $array=$node->pathQuery("report/reference-transcript/type");

While the EssexParser provides generic parsing capabilities for any Essex file (whether produced by ACE or some other application), the EssexACE module provides a more detailed representation and querying facilities for Essex files produced by ACE:

package EssexACE;
# $aceReport=new EssexACE($essexReportElem);
# $aceReport->changeMinPercentMatch($x); # example: 75.3
# $substrate=$aceReport->getSubstrate();
# $transcriptID=$aceReport->getTranscriptID();
# $geneID=$aceReport->getGeneID();
# $vcfWarnings=$aceReport->getNumVcfWarnings();
# $vcfErrors=$aceReport->getNumVcfErrors();
# $cigar=$aceReport->getCigar();
# $transcript=$aceReport->getRefTranscript();
# $transcript=$aceReport->getMappedTranscript();
# $statusString=$aceReport->getStatusString();
#   status = mapped/splicing-changes/no-transcript/bad-annotation
# $bool=$aceReport->hasBrokenSpliceSite();
# $array=$report->getBrokenSpliceSites(); [pos,type=GT/AG]
# $array=$aceReport->getAltTranscripts();
# $bool=$aceReport->proteinDiffers();
# $percent=$aceReport->getProteinMatch(); # example: 98.57 (whole number)
# $bool=$aceReport->frameshift();
# $percent=$aceReport->frameshiftPercentMismatch();
#    example: 83 (whole num)

# $nucs=$aceReport->frameshiftNucMismatch();
# $bool=$aceReport->mappedPTC();
#         premature stop codon when status="mapped"

# $bool=$aceReport->mappedNMD(50); # only valid when status="mapped"
# $bool=$aceReport->mappedNoStart(); # assumes status="mappped"
# $bool=$aceReport->mappedNonstop(); # no stop codon; status="mapped"
# $bool=$aceReport->refIsCoding();
# $bool=$aceReport->mappedIsCoding();
# $bool=$aceReport->lossOfCoding(); # ref is coding, alt is noncoding
# $bool=$aceReport->allAltStructuresLOF();# status=splicing-changes
# $bool=$aceReport->allExonSkippingLOF(); # status=splicing-changes
# $bool=$aceReport->allExonSkippingNMD(); # status=splicing-changes

Any reference-transcript, mapped-transcript, or transcript element in an Essex file can be converted into a Transcript object that can then be queried in various ways specific to transcripts:

package Transcript;
# $transcript=new Transcript($essexNode);
# $rawExons=$transcript->getRawExons();
# $success=$transcript->loadExonSequences(\$chromosome);
# $seq=$transcript->loadTranscriptSeq(\$chromosome);
# $bool=$transcript->equals($other);
# $bool=$transcript->overlaps($otherTranscript);
# $bool=$transcript->overlaps($begin,$end);
# $len=$transcript->getLength(); # sum of exon sizes
# $len=$transcript->getExtent(); # end-begin
# ($begin,$end)=$transcript->getCDSbeginEnd();
# $n=$transcript->numExons();
# $exon=$transcript->getIthExon($i);
# $n=$transcript->numUTR();
# $utr=$transcript->getIthUTR($i);
# $transcript->getSubstrate();
# $transcript->getSource();
# $gff=$transcript->toGff();
# $id=$transcript->getID();
# $id=$transcript->getTranscriptId();
# $id=$transcript->getGeneId();
# $begin=$transcript->getBegin();
# $end=$transcript->getEnd();
# $strand=$transcript->getStrand();
# if($transcript->isWellFormed(\$sequence));
# $transcript->getScore();
# my $nextExon=$transcript->getSuccessor($thisExon);
# $transcript->shiftCoords($delta);
# $transcript->reverseComplement($seqLen);
# $transcript->setStopCodons({TGA=>1,TAA=>1,TAG=>1});
# $g=$transcript->getGene();
# $genomicCoord=$transcript->mapToGenome($transcriptCoord);
# $transcriptCoord=$transcript->mapToTranscript($genomicCoord);
# $exon=$transcript->getExonContaining($genomicCoord);
# $array=$transcript->getSpliceSites();
# $array=$transcript->parseExtraFields(); # array of [key,value] pairs
# $hash=$transcript->hashExtraFields(\@keyValuePairs);

Transcript objects are composed of Exon objects, shown below.  Note that both CDS (coding segment) and UTR exons are represented, separately, as Exon objects.  The Transcript::getRawExons() function coalesces these together into exons regardless of coding status.

package Exon;
# $exon->containsCoordinate($x) : boolean
# $length=$exon->getLength();
# $exon->reverseComplement($seqLen);
# $strand=$exon->getStrand(); # "+" or "-"
# $bool=$exon->overlaps($otherExon)
# $sequence=$exon->getSequence();
# $transcript=$exon->getTranscript();
# $gff=$exon->toGff();
# $begin=$exon->getBegin();
# $end=$exon->getEnd();
# $frame=$exon->getFrame();
# $type=$exon->getType();
# $score=$exon->getScore();
# $exon->shiftCoords($delta);
# $substrate=$exon->getSubstrate();


Utilities

While parsing the output of ACE directly using custom scripts and the above parsing modules (or via an XML parser) is the most powerful way to perform complex queries and analyses of ACE output, simple analyses can often be performed using the utility scripts provided with ACE.  You can find these in your $ACE directory.

essex-to-xml.pl <in.essex>

This script converts an Essex file to XML, so that standard XML tools can be applied.  All information in the Essex file is retained in the XML.

Example: $ACE/essex-to-xml.pl 1.essex > 1.xml

essex-to-tsv.pl <in.essex>

This script converts an Essex file to a Tab-Separated Value (TSV) file for easy parsing or reading into programs such as Excel.  Note that because Essex is a highly structured format, not all information can be represented conveniently in TSV format.

Example: $ACE/essex-to-tsv.pl 1.essex > 1.tsv

This produces a tab-delimited text file with columns for gene and transcript identifiers, coordinates, and binary (0/1) indicators for the status of the transcript in terms of various types of loss-of-function events:

GeneID  TranscriptID    RefSubstrate    AltSubstrate    VcfWarnings     VcfErrors       Begin   End     Strand  Type    Mapped  SplicingChanges BrokenSpliceSite        ProteinDiffers  Frameshift      PrematureStop   NMD     StartCodonChange        NoStartCodon    NoTranscript    NonstopDecay

ENSG00000272636.1       ENST00000343572.7       chr17   ENSG00000272636.1_1     0       0       1000    26618   -
       protein-coding  1       0       0       0       0       0       0       0       0       0       0

ENSG00000272636.1       ENST00000609727.1       chr17   ENSG00000272636.1_1     0       0       8521    26152   -
       protein-coding  0       0       0       0       0       1       0       0       0       0       0

...etc...


essex-to-gff.pl <in.essex> <out.gff>

This program extracts transcript and exon coordinates from an Essex file and outputs them in GTF format (GFF2) for use in external applications, such as an RNA-seq quantification pipeline.  Both mapped transcripts (with no splicing changes) and predicted alternate transcripts are extracted.  Predicted alternate splice forms are given an identifier with "ALT" prefixed.

Example:  $ACE/essex-to-gff.pl 1.essex out.gff

essex-dump-status.pl <in.essex>

This script prints a tab-delimited list of transcripts and their mapping status.  Mapped transcripts are those that mapped perfectly with no changes to the gene structure.  Other possible status values indicate changes to splicing patterns, changes to the reading frame, changes to the protein, or errors in the annotation, as illutrated below:

Example: $ACE/essex-dump-status.pl 1.essex

ENSG00000272636.1_1  0/0  ENST00000343572.7  mapped
ENSG00000272636.1_1  0/0  ENST00000609727.1  bad-annotation bad-start premature-stop
ENSG00000174837.10_1 0/0  ENST00000450315.3  splicing-changes broken-donor
ENSG00000184731.5_1  0/0  ENST00000327669.4  mapped protein-differs
...etc...
The 0/0 column indicates the number of VCF warnings followed by the number of VCF errors.  Warnings indicate that overlapping variants were called but were able to be resolved because they were consistent (one subsumed the others).  Errors indicate that overlapping variants could not be resolved because they were inconsistent (for example: an insertion inside a deletion; two insertions at the same reference location, two incompatible SNP alleles; etc.).

essex-filter-errors.pl <in.essex>

This script reads an Essex file and prints to STDOUT any report that does not contain a bad-annotation flag or a too-many-vcf-errors flag.  Bad annotations include reference annotations having nonconsensus splice sites not listed in the configuration file, lacking a start or stop codon, or being predicted to suffer NMD (nonsense-mediated decay). 

essex-get-gene.pl <in.essex> <geneID>

This script simply prints out any Essex report having the given gene ID.

Example: $ACE/essex-get-gene.pl 1.essex ENSG00000227232.4

essex-get-transcript.pl <in.essex> <transcriptID>

Similarly to the previous script, this script prints out any Essex report having the given transcript ID.

Example: $ACE/essex-get-transcript.pl 1.essex
            ENST00000538476.1


essex-get-loss-of-function.pl <in.essex> <min-percent-match>

This script detects a number of different classes of putative loss of function events reported in an Essex file and reports them for each transcript.  The command line takes an Essex file and a minimum percent identity (between 0 and 100) for proteins to be considered to retain function.

Example: $ACE/essex-get-loss-of-function.pl 1.essex 50

The third column of the output gives the loss-of-function category, and the fourth column gives the cause for the predicted loss:

ENSG00000060237 ENST00000530271 NMD frameshift
ENSG00000145506 ENST00000382730 NMD sequence-variant
ENSG00000096996 ENST00000600835 NMD broken-acceptor
ENSG00000032389 ENST00000443925 protein-differs sequence-variant
ENSG00000154035 ENST00000468196 protein-differs frameshift
ENSG00000145494 ENST00000469176 nonstop-decay sequence-variant
ENSG00000171987 ENST00000307616 nonstop-decay frameshift

essex-summarize-status.pl <in.essex>

This script parses an Essex file and tabulates the number of times each status is given for any transcript; this provides a convenient way to get statistics on specific types of changes to gene structures, such as splicing changes, changes to proteins, and frameshifts.

Example: $ACE/essex-summarize-status.pl 1.essex

no-transcript   3
        broken-acceptor 1
        broken-donor    2
splicing-changes        51
        alternate-structures    51
        broken-acceptor 26
        broken-donor    25
mapped  12671
        nonstop-decay   13
        noncoding       3383
        start-codon-change      38
        protein-differs 9170
        premature-stop  173
        new-upstream-start-codon        46
        noncoding-to-coding     3362
        weakened-donor  68
        weakened-acceptor       103
        frameshift      159

essex-tabulate-changes.pl <in.essex>

This script is similar to the previous script in that it tabulates changes to gene structures, but reports the changes in a different format, and with slightly more information. 

Example: $ACE/essex-tabulate-changes.pl 1.essex

GENES_PROTEIN_CHANGE    4060
GENES_CHANGE_START_CODON        23
CODING_TRANSCRIPTS_NO_START_CODON       0
GENES_FRAMESHIFT        107     14667   3073227 166
GENES_NONSTOP_DECAY     11
GENES_PTC       143
MAPPED_GENES_NMD        95
NONCODING_GENES_SPLICING_CHANGE 0
CODING_GENES_SPLICING_CHANGE    36
GENES_SPLICING_CHANGE_NMD       14
BROKEN_DONORS   27
BROKEN_ACCEPTORS        27
CRYPTIC_SITES   182     1068    51
EXON_SKIPPING   43      43      51
ALT_TRANSCRIPTS 3542    3870    3459
WEAKENED_DONORS 68
WEAKENED_ACCEPTORS      103
GENES_NO_TRANSCRIPT_BROKEN_SS   3
GENES_NO_TRANSCRIPT_WEAK_SS     0

For GENES_FRAMESHIFT, the columns are the number of genes having a frameshift, the total frameshift length, the total length squared (for computing variances), and the total number of frameshifts.  For CRYPTIC_SITES, the columns are the number of cryptic sites predicted, the same value squared, and the number of disrupted splice sites.  For EXON_SKIPPING, the columns are the number of exon skipping events predicted, the same value squared, and the number of disrupted splice sites.  For ALT_TRANSCRIPTS, the columns are the total number of alternate splice forms predicted, the same value squared, and the number of genes having any alternate splice forms predicted.


essex-query.pl <in.essex> <tag>

This program reports all elements having a given tag.

Example: essex-query.pl in.essex weakened-donor


(weakened-donor 8548 aac_GT_aagttttgag -15.362 
   aac_GT_taagttttga -19.0666)

(weakened-donor 1056 tttatg_GT_aagtggatat -18.6239
   tttatg_GT_gtccagtcgc -22.6956)

(weakened-donor 6130 ttctgg_GT_aggagtctgc -20.6655
   ttctgg_GT_aggagtcttt -21.6978)


essex-path-query.pl <in.essex> <query>

This program is similar to essex-query.pl except that the query is a path query, in which the path to an element is specified.  This allows more specific queries that indicate elements having a given path from the root.  Paths are specified similarly to UNIX file paths, except that rather than directory names, a path query contains Essex nodes in a nested relation (parent-child).

Example: $ACE/essex-path-query.pl 1.essex
   report/status/alternate-structures/transcript


The result of this example would be a set of transcript elements that are nested within an alternate-structures element (meaning that they are predicted gene structures resulting from a splicing-disruption variant).


filter-gencode.pl <in.gff>

This program is provided for parsing and filtering GENCODE gene annotations for use in ACE.  It filters out all but known genes that are either protein-coding genes or lincRNAs.

Example: $ACE/filter-gencode.pl infile.gff > outfile.gff

normalize-gff.pl <infile.gff>

This program ensures that the type (coding versus noncoding) of each gene in a GFF2/GTF file is properly specified.

Example: $ACE/normalize-gff.pl infile.gff > outfile.gff

subset-vcf-by-sample <in.vcf> <sampleID> <out.vcf>

This program extracts a single sample from a VCF file, allowing ACE to be run on that one sample rather than all the samples.  The input and output VCF files may be bgzipped.

Example: $ACE/subset-vcf-by-sample chr1.vcf.gz HG00099 out.vcf

sort-vcf.pl <in.vcf.gz> <out.vcf.gz> <temp-dir>

This program position-sorts a VCF file so that it can be indexed by tabix.  A temp directory must be provided for temporary files used by the UNIX sort command.


Contact

Please direct all bug reports and feature requests to Bill Majoros.

ACE was developed by Bill Majoros in the Reddy Lab at Duke University in collaboration with members of the Yandell Lab at the University of Utah and Michael Campbell at Cold Spring Harbor Lab.  Funding for ACE was provided through grants R01 DK099820 to Tim Reddy, R01 DK097534 to Tim Reddy,
NSF IOS-1126998 to Mark Yandell, and NSF EAGER 1561337 to Mark Yandell.






contact: bmajoros@duke.edu

Hummingbird photo by Bill Majoros.  Used with permission.