Skip to content

bjmt/bwtk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

install with bioconda

bwtk: A bigWig Toolkit

bwtk is a program containing a set of utilities for handling bigWig files, as well as converting to and from the bedGraph format. bigWig files are indexed binary files which allow for fast read access of different parts of the file, ideal for genome browsers and extracting data from specific chromosomes or subsequences quickly. It has some limitations however, such as only being able to work with genomes that don't contain any individual chromosome larger than the size limit of uint32_t (around 4.2 billion). This program is built upon the libBigWig library to read and write bigWigs. One note of caution regarding bwtk's usage of this library: during the final stages of creating new bigWigs, an indexing step occurs which can require a very large amount of memory (on the order of hundreds of MBs, perhaps even GBs of memory for larger files). Therefore, expect any subcommand which involves creating new bigWig files to take up a decent chunk of memory. bwtk is otherwise a fairly lightweight and fast program, needing only a few MBs of memory for the remaining subcommands.

bwtk mostly reimplements some of the existing functionality from the original UCSC tools, with a few additions and improvements:

  • bedGraph files can be gzipped when converting to bigWig
  • Multiple bigWigs can be averaged/summed/min'd/max'd together
  • Retrieval of single-base resolution data from BED ranges
  • Subsetting of bigWigs
  • Value operations such as addition, multiplication, log10 transformation
  • Binning of bigWig values and collapsing of any resulting sequential ranges with identical values, leading to substantially smaller file sizes depending on the bin step value

Installation

bwtk has two dependencies: Zlib and libBigWig. There are two way to properly link these into bwtk: build pre-bundled versions of the dependencies (in libs/) and statically link them into bwtk during compilation, or dynamically link bwtk with system-wide versions of the dependencies.

# Install by statically linking the pre-bundled dependencies:
make libz libBigWig
make release

# Install by dynamically linking system-wide dependencies:
make z_dyn=1 bw_dyn=1 release

Please note that the pre-bundled libBigWig comes with curl functionality disabled. If you need to be able to run bwtk on remote files, please link your own version of libBigWig dynamically.

Quick start

File transformations

There are three subcommands relating to creating bigWigs:

  • bg2bw: one bedGraph -> [operation] -> one bigWig
  • adjust: one bigWig -> [subset] -> [operation] -> one bigWig
  • merge: Multiple bigWigs -> [merge] -> [operation] -> one bigWig

The only differences between adjust and merge are that adjust works on a single input file and allows for subsetting the output, and merge works on more than one input file but has no subsetting option.

Data extraction

As well as three subcommands which create non-bigWig outputs:

  • values: Extract single base-resolution range scores
  • score: Calculate summary information of range scores
  • chroms: Retrieve chromosome names and sizes

Example usage: converting a bedGraph and reducing the file size

Let us consider a theoretical Arabidopsis ATAC-seq sample, starting from a compressed bedGraph:

$ du -h atac.bedGraph.gz
382M    atac.bedGraph.gz

To start, we can first convert to bigWig.

$ bwtk bg2bw -i atac.bedGraph.gz -g chrom.sizes -o atac.bw
$ du -h atac*
382M    atac.bedGraph.gz
328M    atac.bw

382 MB is a fairly typical size for an Arabidopsis genome-wide compressed bedGraph file, if perhaps on the larger side. Converting to bigWig doesn't give us much space savings, but it does make loading into IGV and other applications much faster. However there is a trick we can use to substantially reduce the file size: binning.

First, let's get an idea of the basic properties of the data by getting some chromosome-level statistics:

$ bwtk score -i atac.bw -o-
name	size	covered	sum	mean0	mean	min	max
1	30427671	28792570	3.97622e+07	1.30678	1.38099	0.006472	43.6019
2	19698289	19260767	2.35234e+07	1.19419	1.22131	0.006472	44.2038
3	23459830	22963063	2.96944e+07	1.26576	1.29314	0.006472	40.5665
4	18585056	18154827	2.34319e+07	1.26079	1.29067	0.006472	40.7154
5	26975502	26081099	3.45877e+07	1.28219	1.32616	0.006472	42.9223
Mt	366924	0	0	0	0	0	0
Pt	154478	0	0	0	0	0	0

From these results, we can make a couple of guesses: the background signal is likely around 1, and the values in peak regions range from 1 to 40. We can first try using a very rough binning, such as rounding to the nearest integer:

$ bwtk adjust -i atac.bw -s 1 -o atac.s1.bw
$ du -h atac*
382M    atac.bedGraph.gz
328M    atac.bw
8.7M    atac.s1.bw

As you can see, this give us an impressive 37X size reduction over the original bigWig! However, rounding to the nearest integer may not be giving us sufficient resolution. Before we check how it looks in a genome browser, we can try a couple of smaller bin sizes:

$ bwtk adjust -i atac.bw -s 0.5 -o atac.s05.bw
$ bwtk adjust -i atac.bw -s 0.1 -o atac.s01.bw
$ du -h atac*
382M    atac.bedGraph.gz
328M    atac.bw
59M     atac_s01.bw
14M     atac_s05.bw
8.7M    atac_s1.bw

Obviously, using smaller bin sizes gives us larger bigWigs. Still, we get an over 5X size reduction for our smallest bin size. But let's compare:

Zooming out to a 100 kb view, we can see all of the peaks are still distinct at all bin sizes. However, there is clear strong blockiness at the coarsest bin size of 1 which is rather visually unappealing. Things look a bit better at 0.5. Let's try zooming in a bit more:

The bin size of 0.5 still looks quite good considering the 23X size reduction! However, there is a little bit of blockiness in the peak shapes. The bin size of 0.1 on the other hand is still nearly indistinguishable from the original track. Let's zoom in once more:

Now, in this very close up shot of some very small peaks we can finally see the blockiness effect of the smallest bin size, though we can still see all of the peak shape details! Obviously the decision of which bin size to use will depend on the distribution of values in the bigWig, as well as the desired scale we would like to visualize the data.