Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BoyerMoore Updated with example #24

Open
agento2 opened this issue Nov 5, 2016 · 5 comments
Open

BoyerMoore Updated with example #24

agento2 opened this issue Nov 5, 2016 · 5 comments

Comments

@agento2
Copy link

agento2 commented Nov 5, 2016

In testing pattern searches with BoyerMoore i can't get it to return any matches until the search string starts with a *. This also results in position 0 always being the match location. I'm going to try and do further testing but ever the simple example from the cookbook doesn't behave as i would expect.

@evolvedmicrobe
Copy link
Member

Eck, that's no good. I haven't used that aspect of the library, will see if I can recreate your example

@evolvedmicrobe
Copy link
Member

I didn't dig into the algorithm, but could verify that I didn't need the search string to start with a * or that it always matches at the start. See these examples:

        ISequence sequence = new Sequence (Alphabets.AmbiguousDNA, "GCTAGGTAGCTCAAAAAAGGG");
        IPatternFinder searcher = new BoyerMoore () {
            IgnoreCase = true
        };

        foreach (var pos in searcher.FindMatch (sequence, "GCTA")) {
            Console.WriteLine ("Found Match at: {0}", pos); //0 correct 
        }

        foreach (var pos in searcher.FindMatch (sequence, "GCTCA")) {
            Console.WriteLine ("Found Match at: {0}", pos); // 8 correct
        }

@evolvedmicrobe
Copy link
Member

Okay, I can't reproduce this problem after trying a few times so will close this. Feel free to reopen if you can upload an example though.

The wildcard indicates not just one character but 0-infinity of them, so if your search string is within the sequence, you will always have a 0 returned if you lead off with a *, as everything from the start of the sequence on is part of the pattern match.

@agento2
Copy link
Author

agento2 commented Nov 7, 2016

I can't reopen, but I did get your example to work but then poking at it since my code still wasn't working i found this

ISequence sequence = new Sequence(Alphabets.AmbiguousDNA, "gcTAGGTAGCTCAAAAAAGGG");
            searcher = new BoyerMoore()
            {
                IgnoreCase = true
            };

            foreach (var pos in searcher.FindMatch(sequence, "GCTA"))
            {
                Console.WriteLine("Found Match at: {0}", pos); //0 correct // Now no match
            }

            foreach (var pos in searcher.FindMatch(sequence, "gCTCA*GGG"))
            {
                Console.WriteLine("Found Match at: {0}", pos); // 8 correct
            }

appears there's some issue with the sequence constructor having lower case letters in it. It is really kind of weird the first one stops matching but the second one is fine. If I make the sequence with a toLower on your string neither matches anymore.
And if I toUpper mine it matches. So is this just something that needs documented or is there actually an issue.

@agento2
Copy link
Author

agento2 commented Nov 7, 2016

here's an example where it fails. I generated 2500bp of random then a couple short bits and inserted them and it will find the individuals but the two combined with a wild card in the middle fails.

namespace matchTest
{
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.Threading.Tasks;
    using Bio;
    using Bio.Algorithms.StringSearch;
    using Bio.Extensions;

    class Program
    {
        static void Main(string[] args)
        {
            var sample = new Sequence(Alphabets.DNA, Seq.ToUpper());
            IPatternFinder searcher = new BoyerMoore()
            {
                IgnoreCase = true
            };
            var boyerMatches = searcher.FindMatch(sample, search1);
            foreach (var pos in boyerMatches)
            {
                Console.WriteLine("Found Match at: {0}", pos); // 11
            }
            Console.WriteLine("end 1");

            boyerMatches = searcher.FindMatch(sample, search2);
            foreach (var pos in boyerMatches)
            {
                Console.WriteLine("Found Match at: {0}", pos); // 36
            }
            Console.WriteLine("end 2");

            boyerMatches = searcher.FindMatch(sample, string.Format("{0}*{1}", search1, search2)); 
            foreach (var pos in boyerMatches)
            {
                Console.WriteLine("Found Match at: {0}", pos); //Expect 11 get nothing
            }
            Console.WriteLine("end combo");

            Console.ReadLine();
        }

        private static string search1 = "ATGTCCCACGTGAAACGTTGCT";
        private static string search2 = "AAACCCTCAGGTTTCTGAGCGACA";

        private static string Seq =
            "AAGCTTTAAACATGTCCCACGTGAAACGTTGCTGGGAAACCCTCAGGTTTCTGAGCGACAAGTTCGCGCTCATAACTTGGTCCGAATGCGGGTTCTTGCATCGTTCGACTGAGTTTGTTTCATGTAGAACGGGCGCAAAGTATACTTAGTTCAATCTTCAATACCTCGTATCATTGTACACCTGCCGGTCACCACCCAACGATGTGGGGACGGCGTTGCAACTTCGAGGACCTAATCTGACCGACCTAGATTCGGCACTGTGGGCAATATGAGGTATTGGCAGACACCCAGTGCCGAACAACACCTGACCTAACGGTAAGAGAGTCTCATAATGCGTCCGGCCGCGTGCCCAGGGTATATTTGGACAGTATCGAATGGACTGAGATGAACCTTTACACCGATCCGGAAACGGGTGCGTGGATTAGCCAGGAGCAAACGAAAAATCCTGGGCTACTTGATGTCTTGTGACGTTCTTAGAGATGGACGAAATGTTTCACGACCTAGGATAAGGTCGCCCTACAAAATAGATTTGTGCTACTCTCCTCATAAGCAGTCCGGTGTATCGAAAGAACAAGACTAGCCTTGCTAGCAACCGCGGGCTGGGGGGCTAAGGTATCACTCAAGAAGCAGGCTCGGTAACATACGGTCTAGCTATCTGACTATCGCCTACGTCATATAGGGACCTATGATATCTGCGTGTCCAACCTTAGAATTCACTTCAGCGCGCAGGTTTGGGTCGAGATAAAATCACCAGTACCCAAGACCACGGGCGCTCGGCGCCTTGGCTAATCCCGGTACATGTTGTTATAAATAATCAGTAGAAAATCTGTGTTAGAGGGTCGAGTCACCATATATCAAGAACGATATTAATCGGTGGGAGTATTCAACGTGATGAAGACGCTGGGTTTACGTGGGAATGGTGCTTCTGTCCTAACAGGCTAGGGTATAATGCTGAAACCGTCCCCCAAGCGTTCAGGGTGGGGTTTGCTACGACTTCCGAGTCCAAAGTGTCCGTGTTTTTGATATATACGCTCAAGGGCGAGAATTGGACCTGGCTTACGTCTTAGTACGTAGCATGGTGACACAAGCACAGTAGATCCTGCCCGCGTTTCCTATATATTAAGTTAAATCTTATGGAATATAATAACATGTGGATGGCCAGTGGTCGGTTGTTACACGCCTACCGCAATGCTGAAAGACCCGGACTAGAGTGGCGAGATCTATGGCGTGTGACCCGTTATGCTCCATTTCGGTCAGTGGGTCACAGCTAGTTGTGGATTGGATTGCCATTCTCCGAGTGTTTTAGCGTGACAGCCGCAGGGATCCCATAAAATGCAATCGTAGTCCACCTGATCGTACTTAGAAATGAGGGTCCGCTTTTGCCCACGCACCTGATCGCTCCTCGTTTGCTTTTAAGAACCGGACGAACCACAGAGCATAAGGAGAACCTCTAGCTGCTTTACAAAGTACTGGTTCCCTTTCCAGCGGGATGCTTTATCTAAACGCAATGAGAGAGGTATTCCTCAGGCCACATCGCTTCCTAGTTCCGCTGGGATCCATCGTTGGCGGCCGAAGCCGCCATTCCATAGTGAGTTCTTCGTCTGTGTCATTCTGTGCCAGATCGTCTGGCAAATAGCCGATCCAGTTTATCTCTCGAAACTATAGTCGTACAGATCGAAATCTTAAGTCAAATCACGCGACTAGACTCAGCTCTATTTTAGTGGTCATGGGTTTTGGTCCCCCCGAGCGGTGCAACCGATTAGGACCATGTAGAACATTAGTTATAAGTCTTCTTTTAAACACAATCTTCCTGCTCAGTGGTACATGGTTATCGTTATTGCTAGCCAGCCTGATAAGTAACACCACCACTGCGACCCTAATGCGCCCTTTCCACGAACACAGGGCTGTCCGATCCTATATTACGACTCCGGGAAGGGGTTCGCAAGTCGCACCCTAAACGATGTTGAAGGCTCAGGATGTACACGCACTAGTACAATACATACGTGTTCCGGCTCTTATCCTGCATCGGAAGCTCAATCATGCATCGCACCAGCGTGTTCGTGTCATCTAGGAGGGGCGCGTAGGATAAATAATTCAATTAAGATATCGTTATGCTAGTATACGCCTACCCGTCACCGGCCAACAGTGTGCAGATGGCGCCACGAGTTACTGGCCCTGATTTCTCCGCTTCTAATACCGCACACTGGGCAATACGAGCTCAAGCCAGTCTCGCAGTAACGCTCATCAGCTAACGAAAGAGTTAGAGGCTCGCTAAATCGCACTGTCGGGGTCCCTTGGGTATTTTACACTAGCGTCAGGTAGGCTAGCATGTGTCTTTCCTTCCAGGGGTATGCGGGTGCGTGGACAAATGAGCAGCATACGTATTTACTCGGCGTGCTTGGTCTCTCGTATTTCTCCTGGAGATCAAGGAAATGTTTCATGTCCAAGCGAAAAGCCGCTCTACGGAATGGATCTACGTTACTGCCTGCATAAGGAGACCGGTGTAGCCAAGGACGAAAGCGACCCTAGGTTCTAACCGTCGACTT";
    }
}

@agento2 agento2 changed the title BoyerMoore BoyerMoore Updated with example Nov 7, 2016
@evolvedmicrobe evolvedmicrobe reopened this Nov 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants