I found an “easy” problem from [LeetCode] problem set. Problem is easy to solve, but there’s one solution from discussion forum that may provide some inspiration on other problems. The problem description goes like following.
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
The solution below is quite tricky(at least for me).
For DNA, there are only 4 symbols:
T, of which the ASCII code is
0x54. They differ on the last number, which is at most 7, so they can be represented by 3 bits.
Hence a 4-byte integer can represent this DNA sequeence of length 10, and the 2 bits on the head are left empty. For example, we can use
00|001|001|001|001|001|011|011|011|011|011 to represent
"AAAAACCCCC". Each time reading a new letter, the integer is moved to left by 3 bits and do ‘or’ with the letter’s last 3 bits. Thus one integer is a kind of unique letter sequence. Then a
map is used to record how many times each sequence appears.
To generalize the solution, say, if we want to find 3-letter-long subsequences which occur more than once, we need to modify the
if statement to