HSX Format

Format Specification version 1.0.0, January 12, 2010

TABLE OF CONTENTS

Introduction

HSX is a binary file format for indexing (or listing) DNA sequences in other files, allowing fast random access to those sequences. The format was created as part of the LASTZ project, providing a means to input selected sequences from several short read files into a single run of LASTZ.

This document is provided for users interested in creating HSX files with programs of their own design.

The HSX file contains a sequence index array and an associated hash table. Each sequence index entry includes the sequence's name, length, and a reference to the location of the sequence's data in some other file. This array can be accessed either sequentially or via the hash table. Note that the names in the index file do not have to match the original names or headers in the sequence files.

Sequence entries are ordered by the hashes of their names. Conceptually, the hash table groups the sequences into buckets of sequences with the same hash. For each bucket, the table gives the location in the sequence index of the first sequence in that bucket. Hash collisions are resolved by scanning subsequent index entries for the remaining sequences in the bucket.

There is also a file table, which allows a single index file to cover sequences from multiple sequence files of varying formats. However, currently LASTZ only supports indexing of files in FASTA format.

File Specification

The file is stored in a binary format described by the table below. It can be written on either a big-endian or little-endian machine; programs reading the file determine the byte order of multi-byte fields by examining the magic number at the start of the file.

File OffsetDataMeaning
0x00 D2 52 70 95
—or—
95 70 52 D2
Magic number indicating big-endian byte order.

Magic number indicating little-endian byte order.
0x04 00 00 01 00 File conforms to version 1.0 of the HSX file format.
0x08 00 00 00 1C Header length in bytes, including this field through the SOFF field.
0x0C 00 00 00 xx FLEN: number of entries in the file table (limited to 255).
0x10 xx xx xx xx FOFF: offset (from file start) to the file table.
0x14 xx xx xx xx HLEN: number of buckets in the hash table. This also serves as the modulus for the hash function.

Typically the number of buckets is set so that the average number of sequences per bucket (SLEN/HLEN) is reasonably small (e.g. 10).

The hash table actually includes HLEN+1 buckets. An extra sentinel bucket is appended at the end of the table, containing the offset to just past the end of the sequence index table.

0x18 xx xx xx xx HOFF: offset (from file start) to the hash table.
0x1C xx xx xx xx SLEN: number of entries in the sequence index table.
0x20 xx xx xx xx SOFF: offset (from file start) to the sequence index table.

Entries in the sequence index table are necessarily stored in hash order. Entries with the same hash are stored in alphabetical order (actually, in lexicographic order over the bytes of their names.)

See the hashing description below this table for more information.

FOFF xx xx xx xx FINFO0: offset (from file start) to the info record for the first sequence file (file 0).
Offsets to info records for the remaining FLEN-1 files.
FINFO0 xx xx  FTYPE0: file type for file 0, stored as a length byte FTYPELEN0 followed by FTYPELEN0 bytes of ASCII text.

This is equivalent to a file extension (without a leading .) and will be used as such. In the current implementation, it must be fa or fasta.

Together, this field and the next comprise a single info record in the file table.

FINFO0+1+FTYPELEN0 xx xx  FNAME0: file name for file 0, stored as a length byte FNAMELEN0 followed by FNAMELEN0 bytes of ASCII text.

This is used as the base file name for the corresponding sequence file, including path. However, it is usually an empty string, in which case the base name and path are copied from the name and path of the HSX file itself. This allows files to be renamed without rebuilding the index.

Info records for the remaining FLEN-1 files.
HOFF xx xx xx xx xx SOFFH(0): offset (from file start) into the sequence index table, pointing to the first sequence in the first hash bucket (bucket 0).

SOFFn is the file offset for the n-th entry in the sequence index table. H(k) is the number of sequences that have a hash code less than that of bucket k (i.e. the number of sequences assigned to buckets before bucket k). Therefore SOFFH(k) points to the first sequence in the kth hash bucket.

The most significant bit in a bucket's SOFFH(k) value is used to indicate whether the bucket is empty or not. If a bucket is empty, this bit is set (1), otherwise it is clear (0). The end of the sequences for bucket k can be determined from SOFFH(k+1) (the entry for the start of the next bucket).

Offsets for the first sequences in the remaining HLEN-1 buckets.
HOFF+5*HLEN xx xx xx xx xx Sentinel hash bucket. This contains an offset to the end of the sequence index table (i.e., to the byte just beyond the last entry).
SOFF xx xx xx xx xx IXLEN0: length (in nucleotides) of the first sequence.

A sequence may be empty, so zero is a legitimate value for the sequence length.

Together, this field and the next three comprise a single entry in the sequence index table.

SOFF+5 xx IXFILE0: index into the file table for the file containing the first sequence.
SOFF+6 xx xx xx xx xx xx IXOFF0: offset (from the start of the appropriate sequence file) pointing to the first sequence.
SOFF+12 xx xx  IXNAME0: name of the first sequence, stored as a length byte IXNAMELEN0 followed by IXNAMELEN0 bytes of ASCII text.
Sequence index entries for the remaining SLEN-1 sequences.

Hash Function

The code for the underlying hash function is shown below, written in C. The hash bucket for sequence name NAME is computed by

    bucket = hassock_hash(NAME,strlen(NAME)) % HLEN;

This hash function is a variant of Austin Appleby's MurmurHash2. The primary differences are that it has the seed hardwired and scans the input data in the reverse order (this is not structly true, since the non-multiple-of-four leftover bytes are handled slightly differently). It is also endian-neutral.

    #include <stdint.h>

    uint32_t hassock_hash (const void* key, uint32_t len)
        {
        const uint32_t seed = 0x5C3FC4D3;
        const uint32_t m    = 0x87C10417;
        const uint8_t* data = ((const uint8_t*) key) + len;
        const uint8_t* stop = ((const uint8_t*) key) + 4;
        uint32_t       h, k;

        h = seed ^ len;
        while (data >= stop)
            {
            k  = *(--data);
            k |= *(--data) << 8;
            k |= *(--data) << 16;
            k |= *(--data) << 24;
            k *= m;
            k ^= k >> 24;
            k *= m;
            h *= m;
            h ^= k;
            len -= 4;
            }
        switch (len)
            {
            case 3: h ^= *(--data) << 16;
            case 2: h ^= *(--data) << 8;
            case 1: h ^= *(--data);
                    h *= m;
            }
        h ^= h >> 13;
        h *= m;
        h ^= h >> 15;
        return h;
        }

Example

In this example, we have 10 sequences from 3 fasta files, indexed by a single HSX file. We first show the fasta files, then show a field-by-field hex dump of the corresponding HSX file. For demonstration purposes, the HSX file was created with only 5 buckets. Typical HSX files will deal with more sequences, more files, and have more buckets.

hsxexA.fa contains five sequences:

    >HSXEXA_785
    TAACGGCAATCTTTGGTAGACCTATTGGTCATATCATGAAATTGAAGGAT
    AATTATTGCCATAAAGTTTTTCACGTTACTATCTTTGCCTCGCAATGAAT
    AAAATATTCTTAGGGCTACTTTGTAACCTTGCAGAC
    >HSXEXA_88K
    TTAATTACTCGCATGATCTTTCAAGATCTTTACCGTTCACACAATTTCTC
    GAACACTCAGTA
    >HSXEXA_DNQ
    CAGTGTACAAAATAAACTATTAACTATATGTAGATAGATACATAGAGACA
    AAACGGGTAGCATCTAGTATCCTGACTGCGCATTGTGGGGTGTCGCTTCT
    AAGTACCCGAAATGAGCGT
    >HSXEXA_LRW
    TTAAGTACATTCAGATCCATCATGGTTTCGGAAGCTAATGGGAAAAGGGG
    TACAGAATACAACACCTAGTTGATACGATAGTTAGTTTTTTA
    >HSXEXA_R9V
    TATAGTGCGTGTATGACCAATATTACGATGATCGTGACGCCATAGGGTCA
    TATTCCTTAATATGTAAATATGAAGGTA

hsxexB.fa contains four sequences:

    >HSXEXB_6YF
    AAGAGTTCTTACGGCAATAACAAAATGATGCTGTATCCTAGTAACAGGAA
    CGAACCATTCGCTTCTGTGTTCTATACAGAAGAAACCAGACTCGCTAAAC
    A
    >HSXEXB_WCV
    AATTAGTCTATTAAGGACTATATGTTTACAAGGATGGTAGTCCTAACGGA
    ATTGATACCAATAGGTGGCACTTACCGTAGCTAGGTAGATCGCCCTACTA
    CACCAGCTCAGCCATCTTGCCCCGCCAACT
    >HSXEXB_YKU
    GTCAACAGGTTTTCGGACTGGTGGCTTTCCTGATTTGATATTCAAAGGAA
    ATTAGGGTAAGGACTTTGAGTTGTCATAGAATTCAATTTCGGGCTCCGTC
    CATCACCTCGT
    >HSXEXB_YV1
    GGTGATGTTGTGAATATCACTGTCATGAAGGTCTCCTTCGGCCGCCTTAA
    TCATCATCATAAGTTTCACCATGGTAAAATGAATTAGCCCCAAGCT

hsxexC.fa contains three sequences:

    >HSXEXC_4ZL
    CACATCACATTGGTTGTTCATCCATATAATTATTTCCCATAAACTTTAAG
    AGCTCGGCTGGCCATACGTACTGACTAGCTTAGCCCCTAACTAATCGGCC
    ACAGCGATAGTACA
    >HSXEXC_936
    TGGTTTTTAGAGTCCGTGGAGCCTCTCAGCCACACTGGGTTCGGGAAGTT
    TCAGGCAAGTCCTACCTGTAA
    >HSXEXC_GWD
    CTAATCTGGGCTTGGGTCTGAACTCGCCCATGAGGAGGTAAGCAAACCAA
    TAAATTCGGGTATGGCGGTCTTTATTATGCTTAAGGAACGGAACAA

The twelve sequence names are hashed into separate buckets, and sorted within buckets, like this:

Hash Code/Bucket Sequence Name Fasta file Offset to sequence, in fasta file
0 HSXEXB_6YF hsxexB.fa 0x0000 00000000
1 HSXEXA_785
HSXEXA_DNQ
hsxexA.fa
hsxexA.fa
0x0000 00000000
0x0000 000000E3
2 HSXEXA_88K
HSXEXA_LRW
HSXEXB_YV1
HSXEXC_4ZL
hsxexA.fa
hsxexA.fa
hsxexB.fa
hsxexC.fa
0x0000 00000097
0x0000 00000169
0x0000 00000183
0x0000 00000000
3 HSXEXB_YKU hsxexA.fa 0x0000 00000105
4 HSXEXA_R9V
HSXEXB_WCV
HSXEXC_936
HSXEXC_GWD
hsxexA.fa
hsxexB.fa
hsxexC.fa
hsxexC.fa
0x0000 000001D3
0x0000 00000074
0x0000 00000081
0x0000 000000D6

Here is the complete HSX file, byte-by-byte:

File OffsetDataFieldMeaning
0x00000000 D2 52 70 95 Magic number Big-endian.
0x00000004 00 00 01 00 HSX version 1.0.
0x00000008 00 00 00 1C Header length 28 bytes.
0x0000000C 00 00 00 03 FLEN=3 3 entries in file table.
0x00000010 00 00 00 30 FOFF=30 File table is at 0x00000030.
0x00000014 00 00 00 05 HLEN=5 5 buckets in the hash table.
0x00000018 00 00 00 60 HOFF=60 Hash table is at 0x00000060.
0x0000001C 00 00 00 0C SLEN=0C 12 entries in the sequence index table.
0x00000020 00 00 00 60 SOFF=80 Sequence index table is at 0x00000080.
0x00000024 00 00 00 00
00 00 00 00
00 00 00 00
Padding The creating program can insert padding here, at its descretion.
0x00000030 00 00 00 40 FINFO0=40 Info record for file 0 is at 0x00000040.
0x00000034 00 00 00 4A FINFO1=4A Info record for file 1 is at 0x0000004A.
0x00000038 00 00 00 54 FINFO2=54 Info record for file 2 is at 0x00000054.
0x0000003C 00 00 00 00 Padding
0x00000040 00 66 61 FTYPE0 File type for file 0 is "fa".
0x00000043 06 68 73 78
65 78 41
FNAME0 Base name for file 0 is "hsxexA". File name is hsxexA.fa.
0x0000004A 00 66 61 FTYPE1 File type for file 1 is "fa".
0x0000004D 06 68 73 78
65 78 42
FNAME1 Base name for file 1 is "hsxexB". File name is hsxexB.fa.
0x00000054 00 66 61 FTYPE2 File type for file 2 is "fa".
0x00000057 06 68 73 78
65 78 43
FNAME2 Base name for file 2 is "hsxexC". File name is hsxexC.fa.
0x0000005E 00 00 Padding
0x00000060 00 00 00 00 80 SOFFH(0)=80 Sequence entries for hash bucket 0 start at 0x00000080.
0x00000060 00 00 00 00 97 SOFFH(1)=97 Sequence entries for hash bucket 1 start at 0x00000097.
0x00000060 00 00 00 00 C5 SOFFH(2)=C5 Sequence entries for hash bucket 2 start at 0x000000C5.
0x00000060 00 00 00 01 21 SOFFH(3)=121 Sequence entries for hash bucket 3 start at 0x00000121.
0x00000060 00 00 00 01 38 SOFFH(4)=138 Sequence entries for hash bucket 4 start at 0x00000138.
0x00000060 80 00 00 01 94 SOFFH(5)=194 Sentinel bucket indicates end of sequence entries is at 0x00000194.

The most significant bit of SOFFH(5) is a 1, indicating the bucket is empty.

0x0000007E 00 00 Padding
0x00000080 00 00 00 00 65 IXLEN0=65 (Start of hash bucket 0)
Sequence 0 is 101 bp.
0x00000085 01 IXFILE0=1 Sequence 0 is in file 1 (hsxexB.fa).
0x00000086 00 00 00 00 00 00 IXOFF0=00 Sequence 0 is at file offset 0x0000 00000000.
0x0000008C 0A 48 53 58
45 58 42 5F
36 59 46
IXNAME0 Sequence 0 is named "HSXEXB_6YF".
00000097 00 00 00 00 88 IXLEN1=88 (Start of hash bucket 1)
Sequence 1 is 136 bp.
0000009C 00 IXFILE1=0 Sequence 1 is in file 0 (hsxexA.fa).
0000009D 00 00 00 00 00 00 IXOFF1=00 Sequence 1 is at file offset 0x0000 00000000.
000000A3 0A 48 53 58
45 58 41 5F
37 38 35
IXNAME1 Sequence 1 is named "HSXEXA_785".
000000AE 00 00 00 00 77 IXLEN2=77 Sequence 2 is 119 bp.
000000B3 00 IXFILE2=0 Sequence 2 is in file 0 (hsxexA.fa).
000000B4 00 00 00 00 00 E3 IXOFF2=E3 Sequence 2 is at file offset 0x0000 000000E3.
000000BA 0A 48 53 58
45 58 41 5F
44 4E 51
IXNAME2 Sequence 2 is named "HSXEXA_DNQ".
000000C5 00 00 00 00 3E IXLEN3=3E (Start of hash bucket 2)
Sequence 3 is 62 bp.
000000CA 00 IXFILE3=0 Sequence 3 is in file 0 (hsxexA.fa).
000000CB 00 00 00 00 00 97 IXOFF3=97 Sequence 3 is at file offset 0x0000 00000097.
000000D1 0A 48 53 58
45 58 41 5F
38 38 4B
IXNAME3 Sequence 3 is named "HSXEXA_88K".
000000DC 00 00 00 00 5C IXLEN4=5C Sequence 4 is 92 bp.
000000E1 00 IXFILE4=0 Sequence 4 is in file 0 (hsxexA.fa).
000000E2 00 00 00 00 01 69 IXOFF4=169 Sequence 4 is at file offset 0x0000 00000169.
000000E8 0A 48 53 58
45 58 41 5F
4C 52 57
IXNAME4 Sequence 4 is named "HSXEXA_LRW".
000000F3 00 00 00 00 60 IXLEN5=60 Sequence 5 is 96 bp.
000000F8 01 IXFILE5=1 Sequence 5 is in file 1 (hsxexB.fa).
000000F9 00 00 00 00 01 83 IXOFF5=183 Sequence 5 is at file offset 0x0000 00000183.
000000FF 0A 48 53 58
45 58 42 5F
59 56 31
IXNAME5 Sequence 5 is named "HSXEXB_YV1".
0000010A 00 00 00 00 72 IXLEN6=72 Sequence 6 is 130 bp.
0000010F 02 IXFILE6=2 Sequence 6 is in file 2 (hsxexC.fa).
00000110 00 00 00 00 00 00 IXOFF6=0 Sequence 6 is at file offset 0x0000 00000000.
00000116 0A 48 53 58
45 58 43 5F
34 5A 4C
IXNAME6 Sequence 6 is named "HSXEXC_4ZL".
00000121 00 00 00 00 6F IXLEN7=6F (Start of hash bucket 3)
Sequence 7 is 111 bp.
00000126 01 IXFILE7=1 Sequence 7 is in file 1 (hsxexB.fa).
00000127 00 00 00 00 01 05 IXOFF7=105 Sequence 7 is at file offset 0x0000 00000105.
0000012D 0A 48 53 58
45 58 42 5F
59 4B 55
IXNAME7 Sequence 7 is named "HSXEXB_YKU".
00000138 00 00 00 00 4E IXLEN8=4E (Start of hash bucket 4)
Sequence 8 is 78 bp.
0000013D 00 IXFILE8=0 Sequence 8 is in file 0 (hsxexA.fa).
0000013E 00 00 00 00 01 D3 IXOFF8=1D3 Sequence 8 is at file offset 0x0000 000001D3.
00000144 0A 48 53 58
45 58 41 5F
52 39 56
IXNAME8 Sequence 8 is named "HSXEXA_R9V".
0000014F 00 00 00 00 82 IXLEN9=82 Sequence 9 is 130 bp.
00000154 01 IXFILE9=1 Sequence 9 is in file 1 (hsxexB.fa).
00000155 00 00 00 00 00 74 IXOFF9=74 Sequence 9 is at file offset 0x0000 00000074.
0000015B 0A 48 53 58
45 58 42 5F
57 43 56
IXNAME9 Sequence 9 is named "HSXEXB_WCV".
00000166 00 00 00 00 47 IXLEN10=47 Sequence 10 is 71 bp.
0000016B 02 IXFILE10=2 Sequence 10 is in file 2 (hsxexC.fa).
0000016C 00 00 00 00 00 81 IXOFF10=81 Sequence 10 is at file offset 0x0000 00000081.
00000172 0A 48 53 58
45 58 43 5F
39 33 36
IXNAME10 Sequence 10 is named "HSXEXC_936".
0000017D 00 00 00 00 60 IXLEN11=60 Sequence 11 is 96 bp.
00000182 02 IXFILE11=2 Sequence 11 is in file 2 (hsxexC.fa).
00000183 00 00 00 00 00 D6 IXOFF11=D6 Sequence 11 is at file offset 0x0000 000000D6.
00000189 0A 48 53 58
45 58 43 5F
47 57 44
IXNAME11 Sequence 11 is named "HSXEXC_GWD".
00000194 (file ends here)


Bob Harris and Cathy Riemer, January 2010