Tokenization Algorithm

This goal of this algorithm was to produce a catalogue of vocalizations from the vector sets produced by the Fourier Transform based Analysis. The process took these vector sets and transformed them into a set of characters. This was accomplished by assigning a character value to each frequency section, and finding the section with the greatest amplitude for each interval. The characters produced from each interval were then strung together to produce a character string that could be searched using existing tools.

The first search algorithm tested was a simple string search. It scanned through the catalogue of token strings and reported the name of each entry that contained the supplied search string. This search was tested with a fixed interval of one second and no overlap and the number of sections set to ten and twenty-five. The test set used a set of random substrings from each recording used to build the catalogue, and as such was heavily biased towards success. Despite this, the best results obtained were a 20% success rate in classification. This was likely due to the lack of synchronization between the segments of vocalization used to create the catalogue and those used to test it, as illustrated in the following figure. This figure shows the effect of offsetting the starting time of a sample by a fraction of an interval. This effect means that the character representation of any signal is not unique to that vocalization.

In order to resolve the issues discovered in the first search, a new search algorithm was developed. This search not only searched on the supplied search string, but also searched on every substring longer that one character within that string. Whenever a match for a certain catalogue entry was found, the length of the string that was matched was added to the total for that specific catalogue entry. When the search was complete, the algorithm returned a list of weights for each entry in the catalogue. The entry with the largest weight was considered the best match and the search string was classified as belonging to the corresponding bird species. Testing of this new search was conducted with a range of data sets created using each combination of the following variables: overlap set to 0%, 50% or 85%, interval set to 1, 0.5 or 0.25 seconds and frequency sections set to ten or twenty-five sections. The results of a test run to classify House and Purple Finches are summarized in the following tables.

Percentage of Successful Token Searches using Ten Sections
Overlap (%) \ Interval(s) 1 0.5 0.25
0 0.5000 0.5000 0.5000
25 0.5000 0.5000 0.5000
50 0.5000 0.5000 0.5000
85 0.5000 0.5000 0.5526
 
Percentage of Successful Token Searches using Ten Sections
Overlap (%) \ Interval(s) 1 0.5 0.25
0 0.5625 0.6000 0.5400
25 0.6250 0.5400 0.5000
50 0.5833 0.5200 0.5400
85 0.5208 0.5000 0.4474

The best results obtained reached 62.5% successful classification. Considering that there were only two different classifications, these results are very poor. As a result of these tests, this approach to identification was abandoned.