An Excellent Solution to Repeated DNA Sequences

I found an “easy” problem from [LeetCode] problem set. Problem is easy to solve, but there’s one solution from discussion forum that may provide some inspiration on other problems. The problem description goes like following.

Problem

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

1
2
3
4
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Solution

The solution below is quite tricky(at least for me).

1
2
3
4
5
6
7
8
9
10
11
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
unordered_map<int, int> m;
vector<string> r;
for (int t = 0, i = 0; i < s.size(); i++)
if (m[t = t << 3 & 0x3FFFFFFF | s[i] & 7]++ == 1)
r.push_back(s.substr(i - 9, 10));
return r;
}
};

For DNA, there are only 4 symbols: A, C, G, T, of which the ASCII code is 0x41, 0x43, 0x47, 0x54. They differ on the last number, which is at most 7, so they can be represented by 3 bits.

  • 001——A
  • 011——C
  • 111——G
  • 100——T

Hence a 4-byte integer can represent this DNA sequeence of length 10, and the 2 bits on the head are left empty. For example, we can use 00|001|001|001|001|001|011|011|011|011|011 to represent "AAAAACCCCC". Each time reading a new letter, the integer is moved to left by 3 bits and do ‘or’ with the letter’s last 3 bits. Thus one integer is a kind of unique letter sequence. Then a map is used to record how many times each sequence appears.

To generalize the solution, say, if we want to find 3-letter-long subsequences which occur more than once, we need to modify the if statement to

1
2
if (m[t = t << 10 & 0x3FFFFFFF | s[i] & 7]++ == 1)
r.push_back(s.substr(i - 2, 3));

A Command-Line Translator


I learned how to call an API provided by some online open platform several days ago. Then I implemented a simple translator that runs in command line. Since I’m still new to English, I wrote this tool to help me understand English document when coding. Of course, it is not a perfect version because the translator could only run in directory where the code locates. Anyway, it a practice to a new world.

Because this is a English-to-Chinese translator and I’m not sure if Google API is accessible in China, I decided to call Baidu API to finish my work. The very first step is to login Baidu developer console to create an application, then a new API Key would be generated for you, which is the most important thing when we send request. Now we’re ready to write our code. According to this document(it’s a Chinese website), the request URL is http://openapi.baidu.com/public/2.0/translate/dict/simple, and the request body is shown in the table below

This Winter Break

I should write something in this winter break before it passes.

0 This break lasts for three weeks, from Dec/20/2014 to Jan/11/2015.

1 I planned something for this break.

  • Reading a book Computer Systems: A Programmer’s Perspective and finished one and a half chapter.
  • Taking a class Machine Learning Techniques and just watched the first-week videos.
  • Writing a bittorrent client which was a course project in past semester, but a simplified version. I want to implement all basic functions for an entire bittorrent client in practice. Now I just finished torrent parsing and some functions for interacting with a tracker server.
  • Practicing coding for interviews in next semester, but haven’t get it started.

2 My break should be enriched by my plans, but I cannot foretell future when I planned these.

3 There’s a health problem on me since the start of the second week.

4 Today things are much better(the start of the third week). To prove my body is still working. I finished a game Trine. Anyway, I finished something.

I was down, so almost everything is down.

Service Verification in Port Scanner


Recently I’m writting a port scanner and I need to verify if some standard services are running on remote hosts as expected. The verification method is quite simple(but took me a long time), that is, using connect() to that port, then analyze the returned messages. All messages will be returned by remote host only when the port being scanned is open, otherwise tag it as Unable to be connected.

  • HTTP
    Send string "GET / HTTP\r\n\r\n" to port 80 of an ip address. The remote host will send back message like
1
2
3
4
5
6
HTTP/1.0 400 Bad Request
Content-Type: text/html; charset=UTF-8
Content-Length: 1419
Date: Tue, 02 Dec 2014 05:56:25 GMT
Server: GFE/2.0
..

Then parse the first line we can obtain the version of HTTP running on that machine is 1.0.

A Chrome Extension for My Blog

I just developed a chrome extension for my blog this morning. The extension is used for opening my blog in a new tab with one click.

Open Panda Home

Actually I’m not familiar with javascript. The implementation refered to some code on stackoverflow and the icon pic is borrowed from this website. Although I’m not a professional developer for chrome, I paid 5 dollars and deployed this extension onto chrome app store. I have to admit it is simple and even a little ugly, but I wish you could like it.

Install Valgrind on Mac

Life is hard.

For whom may only concern the final solution, please check the last part—The Whole Procedure.

I got stuck in Exercise 4: Introducing Valgrind when trying to install valgrind on my computer. The tarball downloaded from the official site claims working fine on Mac OS X 10.9, but it doesn’t work for me when I tried to install it. I spent four hours on refering to a lot of similar problems online and luckily, I make valgrind run on my MacBook.

At first, I run the commands provided by the book.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 1) Download it (use wget if you don't have curl)
curl -O http://valgrind.org/downloads/valgrind-*.*.*.tar.bz2
# use md5sum to make sure it matches the one on the site
md5sum valgrind-*.*.*.tar.bz2
# 2) Unpack it.
tar -xjvf valgrind-*.*.*.tar.bz2
# cd into the newly created directory
cd valgrind-*.*.*
# 3) configure it
./configure
# 4) make it
make
# 5) install it (need root)
sudo make install

Relearn C Language

This semester I’m taking a course called computer networks. Learning and understanding how networks operate in theory and practice is helpful to a computer science student, and the most fantastic part of the course, I think, is the homework, which requires us using C or C++ to implement some network applications. Although I learned C language when I was a freshman, long time past and according to the fact that I had no computer experience at that very start, my fundament of C usage is not good enough to code those projects perfectly. Usually I could implement the functions needed in homework document, but a lot of problems in my code are also exposed by testing, such as memory leak. Additionally, when doing my homework, I always have a hard time to debug segmentation fault, and this fault is a kind of common phenomenon in my homework process. Hence I think it’s time to relearn C language, especially pointer and memory management, where I didn’t care about previously.

Recently, I found an online book Learn C The Hard Way. It not only contains silly programs like “hello world”, but also how to debug in Linux machine, data structures, and so on. Finally, as a review, only reading is not enough. At least I should write something when relearning it.

Test Post

This post is used to test $\LaTeX$ features and coding supports.
A formula: $E=mc^2$

1
2
3
4
5
6
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("Hello world\n");
return 0;
}

This is one of my favorite pictures.