-
btrim
A fast and accurate adapter, barcodes, and low-quality region trimming and binning program written in C for
next-generating sequencing reads. The search algorithm is based on Eugene Myers' fast bit-vector algorithm.
Reference:
-
Yong Kong (2011)
Btrim: A fast, lightweight adapter and quality trimming program
for next-generation sequencing technologies,
Genomics,
98,
152-153.
[doi]
[pubmed]
[Elsevier]
[arXiv]
-
Genotyping using chromatograms
Low sequencing quality inevitably leads to errors in base calling,
which in turn results in wrong genotypes.
In addition, for heterozygous alleles, which are co-amplified and
sequenced in the same Sanger sequencing reaction, the sequences often
contain ambiguous bases, which usually have higher error rate in the base calling stage.
The ultimate source of resolution is the
chromatograms from which the sequences are called.
To manually read chromatograms, especially chromatograms of heterozygous
sequences, is laborious and error-prone.
I developed a program
that automatically does genotyping using chromatograms directly.
The program is highly accurate.
An online version of the program is here .
The program needs two files:
-
a text file that contains the names of
the genotypes and the corresponding sequences,
-
and the chromatogram file.
The program searches every sequences in the first file against the chromatogram file to find
the best match.
The algorithm itself has not been published.
It was used in the following publications:
Reference:
-
Natalie R Powers, John D Eicher, Falk Butter, Yong Kong, Laura L Miller, Susan M Ring, Matthias Mann, Jeffrey R Gruen (2013)
Alleles of a polymorphic ETV6 binding site in DCDC2 confer risk of reading and language impairment,
The American Journal of Human Genetics,
93,
19-28.
[doi]
[pubmed]
[Cell]
-
Natalie R Powers, John D Eicher, Laura L Miller, Yong Kong, Shelley D Smith, Bruce F Pennington, Erik G Willcutt, Richard K Olson, Susan M Ring, Jeffrey R Gruen (2016)
The regulatory element READ1 epistatically influences reading and language, with both deleterious and protective alleles,
Journal of Medical Genetics,
53,
163-171.
[doi]
-
Convert gene symbols to ensembl IDs
Online program to convert gene symbols to ensembl IDs
-
Gene symbols to synonyms and aliases
Online program to find synonyms and aliases for a list of gene symbols
-
Pattern search in multiple fasta files with specified error limit
A program written in C programming language to search patterns in multiple fasta files with specified maximum errors (edit distances).
-
Maple code for Type III runs
Reference:
-
Yong Kong
(2015) Number of appearances of events
in random sequences: a new generating function
approach to Type II and Type III runs,
Annals of the Institute of Statistical Mathematics,
69
489-495.
[doi]
[Springer]
[Maple code]
-
Distributions of positive signals in pyrosequencing
Reference:
-
Length distribution of sequencing by synthesis: fixed flow cycle model
Reference:
-
Yong Kong (2013)
Length distribution of sequencing by synthesis: fixed flow cycle model,
Journal of mathematical biology,
67,
389-410.
[doi]
[pubmed]
[Springer]
[arXiv]
-
Calculating complexity of large randomized libraries
Reference:
-
Yong Kong (2009)
Calculating complexity of large randomized libraries,
Journal of Theoretical Biology,
259,
641-645.
[doi]
[pubmed]
[Elsevier]
[arXiv]
|