I'll add more to this file when I have time, but the following are some important tips. - Since the program is fast, the best strategy when new trimming situations come up is trial and error. It does not take much time to run for the whole fastq file. For instant response, you can select a subset of your input sequences: > zcat fastq.gz | head -1000 > test.fq or to get a randomly selected subset: > zcat fastq.gz | sort -R | head -1000 > test.fq and use test.fq for testing. - Pay attention to the lengths of the adaptors. The longer the adaptors, the bigger number of mismatches can be allowed. These values are controlled by the "-u" and "-v" options. It does not make sense if the adaptors are of length 6 and 4 mismatches are allowed. - Pay attention also to the locations of the adaptors in your reads. The options "-b" and "-e" can be used to trim adaptors in different adaptor locations. - The cutoff threshold of average quality score inside the moving window (option "-a") and the size of the window (option "-w") should also be adjusted to meet your needs. - The program can search degenerate patterns or wildcard letters. * Use [] to include degenerate letters in the "pattern" file specified by option "-p". For example, AT[CG]TAC will match either C or G in the third position. * Use "." as the wildcard that can match anything. * Use "^" to negate the letter following it. For example, AT^TGTAC will match anything that is not a T in the third position. * The degenerate letters can appear multiple times in the patterns, such as AT[CG]T[AT]C. - Due to the parallel nature of the algorithm, the "regular expression" kind of search mentioned above does not incur any extra computational cost: the search time is the same as the plain patterns such as ATCGTAC. - We can take advantage of these "regular expression" search in real situations. One example is the barcode trimming and assignment for Illumina sequences. Usually in many Illumina reads the first base is a "N". In this case the regular expression search can be used. For example, if you have four 6-bp barcodes CGGAAT, CGTGGC, TGCGTA, and TTCTGG, you can set up the pattern file as (assuming the barcodes are only in the 5'-end): [CN]GGAAT ZZZZZZZ [NC]GTGGC ZZZZZZZ [NT]GCGTA ZZZZZZZ [NT]TCTGG ZZZZZZZ This "regular expression" search method is better than the plain pattern search. For example, if you ignore the first base and use only the remaining 5 bases, you will end up with a pattern file like this: GGAAT ZZZZZZZ GTGGC ZZZZZZZ GCGTA ZZZZZZZ TCTGG ZZZZZZZ The reason why the 6-base search is better is that a 6-base pattern will have higher specificity than a 5-base pattern. Of course for this kind of short patterns you should also use the "-b" and "-e" options discussed above to restrict the range of barcode locations, and don't forget to adjust the "-u" and "-v" values. - For paired-end sequences, refer to "readme.paired_end" in this web site. - For a specific example to trim Illumina AGATCGGAAGAGC adapter, use Btrim -w 10 -a 25 -p illumina_adapter.txt -3 -P -o output.fastq -l 40 -t <(gunzip -c path_to_your_fastq/*.gz ) -C -z The "pattern" file "illumina_adapter.txt" can be downloaded in this same web site. It also acts as an example for the new 6-column "pattern" file introduced in version 0.3.0. This new feature provides finer control of maximum errors allowed in both 5'-end and 3'-end, and control of the regions where the adapter is expected in the sequence (for both 5'-end adapter and 3'-end adapter). These controls are specified by extra columns in the "pattern" file given by the "-p" option. The details are: now the "pattern" file accepts 6 columns for each line (tab-delimited) in the following format: 5'-adapter 3'-adapter 5'-max-err 3'-max-err 5'-pos 3'-pos For 3'-pos, if negative, it indicates the counts are from the end of the sequence. For example, 5'-pos=20 means the search for the 5'-adapter will only be carried out in 0-20 base region, while 3'-pos=-10 means the search of 3'-adapter will be in sequence_length-10 to the end of the sequence. This setting can routinely achieve 97-99% mapping rate.