Product Management: It’s all about communication

giphy

For larger part of my contribution as an engineer I have been an individual contributor. And straight out of engineering I got a team of my own about 18 months ago. Looking back in time now, this GIF explains exactly how I feel about the period. Not that that I was dumb but that I was so involved in what I was doing that I usually forget to see zoomed out picture. But that’s not it, there is other side of this story as well.

In such short stint of working with the team I have learned many things about how things are in my head and how they actually are. Most important thing that I learnt is that:

Developers can only be as good as their managers are.

What I mean is, that a developer cannot build things better than what is communicated to them. The kid in the GIF did exactly what was communicated to him and that’s what developers do as well. And the worst part of it is that coach is wrong and in my case I was. If I am going to do shoddy job at communicating the problem, the work that will be delivered will be equivalently shoddy and with no fault of the programmer. It simply is not the programmer’s job to interpret what is being communicated, everything that has to be said has to be in absolute terms.

It is easy(relatively) to keep everyone on same page when a team is small and as the number of people grow the communication starts to become jittery, information loss happens. This law is valid for all kinds of communication be it digital communication or man-to-man. I have an equation which can tell how well informed a person is who is n communicating points away from the origin of the communication:

Effectiveness ∝ (language skills of speaker x listener’s grasping skill) ^ n

Here, both language skills of speaker and listener’s grasping skill are in the range of 0 to 1 which means the effectiveness of the communication is always going to be lower than understanding of the person who is trying to communicate. Hence, the person who understands the best is the one who is initiating the communication process and the first loss of information also happens at this point. Human brains stores information in complex form. A narration in the head can comprise of words, pictures, emotions/feeling and moments. Hence, when a person tries to translate the information in the form of words there is a loss of information.

Coming back to the point. If a developer in the team is producing things which is not what is expected, it is only because they have not been communicated well. In order to improve the effectiveness of the communication, it has to be done over and over. The number of times it is repeated, better the understanding of the situation is in listener’s head. Why? Because every time the communication happens our brain automatically improvises on what is to be communicated and add more things to it, the tiny details that we forget or neglect otherwise. A good way to experiment with this idea is to play Pictionary.

This is important for both team members and team lead. Effort has to be made from both directions. I hope this post helps you in being a better team player! 🙂

Cassandra: What it is and what not

Recently I had a chance to work on the Cassandra. To explain the need in short, it was required to have a distributed key-value store. While Redis is great but it doesn’t let you have multiple geographically distributed writable servers but Cassandra does. Writing here few points about Cassandra and so that one can keep them in the back of the head while setting it up.

  • Contrary to what I said, Cassandra is not exactly key-value storage. It is more of a JSON format storage which can behave like key-value pair.
  • CQL – Cassandra Query Language distracts the user from the true nature of Cassandra. It makes one believe that Cassandra is like RDBMS while its not. One must think of it as key-value, where values can be extended further.
  • Data structures provided by Redis can be implemented easily.
  • Throughput are depended on configuration of machine. More the number of CPU, better the throughput can be. Go through other configurations as well, best practices are explained in configuration file.
  • Data is first written in memory(memtables) and on disc(sstables) during compaction, which depends on the settings. By default it is set to 10 days or Java heap size in memory whichever is reached first.
  • Lesser the tombstones, faster will be the compaction. This way, probability of reading sstables, which is disk read, go low.
  • Updates are waste of resource use insert instead, it will overwrite the existing row/document on compaction.
  • To keep the read fast use quorum as 1, that is whatever you are getting on selection is the truth.
  • Partitioning key should be designed as such to keep the data related to it on a single node. IMO partitioning key is analogous to tables in RDBMS. This way for a particular key and quorum equals to 1 will always be truth. See composite partitioning keys.
  • Latency is a thing to worry about. Try to keep reads as low as possible.
  • Schema structure is very important. It is important to learn about the queries one needs to make on the table before writing schema. Unlike RDBMS, condition has to be on successive columns instead of random columns.
  • If using PHP, use Java-client with php-java bridge instead of native PDO driver. It provides almost 3x read/write throughput per node.
  • IRC is good place to get help in case of issues.
  • If nodes are EC2 instances, snitch configured for EC2 is available for used.
  • Version 2.0.9 is not compatible with 2.1.x in a cluster.
  • In general, version below x.x.5 are not production ready and have serious bugs. Current 2.1.1 is not suitable for production environment.

Natural Language Processing – Terminology

Artificial Intelligence has lots of application. One of which is Natural Language Processing or in short NLP. NLP is the study or processing of natural texts to find the interesting patterns or details in natural texts for other purposes like tagging a post automatically. NLP contrary to the general belief, as being very advance computer science, started in early 1950s.

There are various applications of NLP like machine translation(eg: Google Translate), Part of Speech tagging(eg: tagging a word whether it is a noun or verb etc. ), automatic summarization(eg: Text Compactor), natural language generation(eg: Alice, a software that interacts with human.) etc. You can read about it more here.

In this post, I am explaining the terminologies used in NLP.

  • Natural Text
    Natural text is any text which is human readable in general. All languages that human writes are natural language can used for language processing.
    Eg: English, Hindi, French
  • Labeled Text
    Basically a marked text which can then be used by various machine learning or NLP algorithms to label or predict annotation for un-annotated or simply unmarked text.
    Eg: A movie review tagged with positive or negative sentiment.
  • Data Set
    The data on which we intend to perform the task of NLP is data set.
    Eg: Data set of human genome.
  • Attributes
    The data set has certain attributes in general which we use while performing processing. A column in a database table can be considered as attribute.
    Eg: “Wind speed” in weather data.
  • Classification
    When there is list of pre-defined labels then the problem of assigning labels to the data set is called classification.
    Eg: Classifying news articles to their categories, Classifying the tumor benign or malignant.
  • Regression
    When  there is a data set and using that we have to predict a attribute for other attributes, then it is a problem of regression. It is not in general used in NLP.
    Eg: Prediction of weather, Prediction of house rent based on house area.
  • Clustering
    When there are no pre-defined labels but all we know is to find similar type of objects then the problem with which we are dealing is clustering.
    Eg: Clustering the similar kind of reviews or posts.
  • Supervised Learning
    Learning algorithm which learns, actually trains the model which we will talk about in next post, from the previously labeled data. Learning algorithms can be of various types it can be either statistical based or probability based or simple if-else or something else. Some of the algorithms are Naive Bayes, Decision Trees, K Nearest Neighbors, Support Vector Machines  etc. Supervised learning is generally used when we have huge or considerable amount labeled data.
    Eg: Part of Speech Tagging, Severity Prediction for the Bug Data.
  • Unsupervised Learning
    Algorithms which trains the model when there is no label available. These algorithms are generally created by using a “intuition” or heuristic which is nothing but a observation of the data. Some of the algorithms are Artificial Neural Networks, K-Means, Single Linkage etc. The general field where it is used is clustering.
    Eg: Clustering of news articles by their categories, Clustering of users by their data.
  • Semi-Supervised Learning
    This includes both supervised and unsupervised learning. It is used generally on data which have some of it is labeled and largely unlabeled. It learns from both labeled data and labeled after unsupervised learning.
  • Pre-processing
    The task of processing the data set prior to use it for learning the language models. What we basically do in this part of our program is we modify the data set to convert it into a corpus, data which we can feed to the algorithms as they are generic that is work for various kinds of data set. Hence the need is to convert our data so that we can use those algorithms. Below are the three methods which are used in general in NLP pre-processing.
  • N-Gram
    Gram is a unit to measure quantity. In case of NLP gram can be associated with both words and characters.
    Character N-Gram would mean considering ‘n’ consecutive characters at once and same for word. The most general cases are Uni-Gram, single unit, and Bi-Gram, two units.
    Eg: I live in New York.
    Uni-Gram => [‘I’, ‘live’, ‘in’, ‘New’, ‘York’]
    Bi-Gram => [‘I live’, ‘live in’, ‘in New’, ‘New York’]
  • Tokenization
    We break the text into tokens which are in general certain length of words. These extracted phrases or words are called tokens. In simple words, we extract the words or group of certain length of phrases possible from the sentence or text.
  • Stop Words Removal
    Not all words in the language have actual meaning, that is absence of which we do not lost the true meaning or at least gist.
    Eg: Conjugations, prepositions. ‘they’, ‘then’, ‘under’, ‘over’, ‘is’, ‘an’, ‘and’ etc. You can find the list on MySQL website.
    Note: List of stop words depends on the context of the problem.
  • Stemming
    Reducing a word to its root word is called stemming. It has same meaning what we study while we learn about words.
    Eg: ‘education’, ‘educate’, ‘educating’ has the root word ‘educate’.
  • Features
    The tokens which after all kind of pre-processing required and are ready to be fed to our algorithm are called features. Features are those units which we have extracted from the data set and supply to our algorithm so that it can learn. The difference between tokens and features is that when tokens are ready for the algorithm they are called features.
    Eg: Set of words extracted from data set which we can pass to classifier.
  • Features Extraction
    Extraction of features from the tokens. There are various methods to so which we will read about later.
  • Features Selection
    Selection of features from the extracted tokens. Not all tokens are useful, not all kind of data is useful. We have to decide which ones will be good to play with.
  • Training and Testing
    There are two parts, first is trainig where we feed the features to the algorithm then it trains a model or lets say creates a model for those features. Second part is testing where we use the model algorithm has created to test how well it is doing.
  • Evaluation
    There are various methods to evaluate a model. Some common evaluation methods are Accuracy and Precision, Recall, F1-Score.

Note: All the above things may not actually required in all NLP problems but these are some common terms which every NLP enthusiast should be familiar with. There is also a chance of inaccuracy in definition as I tried to explain things in the easiest way possible.

Story of Memcached

One MemcacheClass to rule them all, One MemcacheClass to find them,
One MemcacheClass to bring them all and in the RAMs bind them.

Below is a story I found on memcached. It is an amazing story how memcache work.

Two plucky adventurers, Programmer and Sysadmin, set out on a journey. Together they make websites. Websites with webservers and databases. Users from all over the Internet talk to the webservers and ask them to make pages for them. The webservers ask the databases for junk they need to make the pages. Programmer codes, Sysadmin adds webservers and database servers.

One day the Sysadmin realizes that their database is sick! It’s spewing bile and red stuff all over! Sysadmin declares it has a fever, a load average of 20! Programmer asks Sysadmin, “well, what can we do?” Sysadmin says, “I heard about this great thing called memcached. It really helped livejournal!” “Okay, let’s try it!” says the Programmer.

Our plucky Sysadmin eyes his webservers, of which he has six. He decides to use three of them to run the ‘memcached’ server. Sysadmin adds a gigabyte of ram to each webserver, and starts up memcached with a limit of 1 gigabyte each. So he has three memcached instances, each can hold up to 1 gigabyte of data. So the Programmer and the Sysadmin step back and behold their glorious memcached!

“So now what?” they say, “it’s not DOING anything!” The memcacheds aren’t talking to anything and they certainly don’t have any data. And NOW their database has a load of 25!

Our adventurous Programmer grabs the pecl/memcache client library manual, which the plucky Sysadmin has helpfully installed on all SIX webservers. “Never fear!” he says. “I’ve got an idea!” He takes the IP addresses and port numbers of the THREE memcacheds and adds them to an array in php.

$MEMCACHE_SERVERS = array(
"10.1.1.1", //web1
"10.1.1.2", //web2
"10.1.1.3", //web3
);

Then he makes an object, which he cleverly calls ‘$memcache’.

$memcache = new Memcache();
foreach($MEMCACHE_SERVERS as $server){
$memcache->addServer ( $server );
}

Now Programmer thinks. He thinks and thinks and thinks. “I know!” he says. “There’s this thing on the front page that runs SELECT * FROM hugetable WHERE timestamp > lastweek ORDER BY timestamp ASC LIMIT 50000; and it takes five seconds!” “Let’s put it in memcached,” he says. So he wraps his code for the SELECT and uses his $memcache object. His code asks:

Are the results of this select in memcache? If not, run the query, take the results, and PUT it in memcache! Like so:

$huge_data_for_front_page = $memcache->get("huge_data_for_front_page");
if($huge_data_for_front_page === false){
$huge_data_for_front_page = array();
$sql = "SELECT * FROM hugetable WHERE timestamp > lastweek ORDER BY timestamp ASC LIMIT 50000";
$res = mysql_query($sql, $mysql_connection);
while($rec = mysql_fetch_assoc($res)){
$huge_data_for_frong_page[] = $rec;
}
// cache for 10 minutes
$memcache->set("huge_data_for_front_page", $huge_data_for_front_page, 0, 600);
}

// use $huge_data_for_front_page how you please

Programmer pushes code. Sysadmin sweats. BAM! DB load is down to 10! The website is pretty fast now. So now, the Sysadmin puzzles, “What the HELL just happened!?” “I put graphs on my memcacheds! I used cacti, and this is what I see! I see traffic to one memcached, but I made three :(.” So, the Sysadmin quickly learns the ascii protocol and telnets to port 11211 on each memcached and asks it:

Hey, ‘get huge_data_for_front_page’ are you there?

The first memcached does not answer…

The second memcached does not answer…

The third memcached, however, spits back a huge glob of crap into his telnet session! There’s the data! Only once memcached has the key that the Programmer cached!

Puzzled, he asks on the mailing list. They all respond in unison, “It’s a distributed cache! That’s what it does!” But what does that mean? Still confused, and a little scared for his life, the Sysadmin asks the Programmer to cache a few more things. “Let’s see what happens. We’re curious folk. We can figure this one out,” says the Sysadmin.

“Well, there is another query that is not slow, but is run 100 times per second. Maybe that would help,” says the Programmer. So he wraps that up like he did before. Sure enough, the server loads drops to 8!

So the Programmer codes more and more things get cached. He uses new techniques. “I found them on the list and the faq! What nice blokes,” he says. The DB load drops; 7, 5, 3, 2, 1!

“Okay,” says the Sysadmin, “let’s try again.” Now he looks at the graphs. ALL of the memcacheds are running! All of them are getting requests! This is great! They’re all used!

So again, he takes keys that the Programmer uses and looks for them on his memcached servers. ‘get this_key’ ‘get that_key’ But each time he does this, he only finds each key on one memcached! Now WHY would you do this, he thinks? And he puzzles all night. That’s silly! Don’t you want the keys to be on all memcacheds?

“But wait”, he thinks “I gave each memcached 1 gigabyte of memory, and that means, in total, I can cache three gigabytes of my database, instead of just ONE! Oh man, this is great,” he thinks. “This’ll save me a ton of cash. Brad Fitzpatrick, I love your ass!”

“But hmm, the next problem, and this one’s a puzzler, this webserver right here, this one runing memcached it’s old, it’s sick and needs to be upgraded. But in order to do that I have to take it offline! What will happen to my poor memcache cluster? Eh, let’s find out,” he says, and he shuts down the box. Now he looks at his graphs. “Oh noes, the DB load, it’s gone up in stride! The load isn’t one, it’s now two. Hmm, but still tolerable. All of the other memcacheds are still getting traffic. This ain’t so bad. Just a few cache misses, and I’m almost done with my work. So he turns the machine back on, and puts memcached back to work. After a few minutes, the DB load drops again back down to 1, where it should always be.

“The cache restored itself! I get it now. If it’s not available it just means a few of my requests get missed. But it’s not enough to kill me. That’s pretty sweet.”

So, the Programmer and Sysadmin continue to build websites. They continue to cache. When they have questions, they ask the mailing list or read the faq again. They watch their graphs. And all live happily ever after.

Author: Dormando via IRC. Edited by Brian Moon for fun. Further fun editing by Emufarmers.

This story has been illustrated by the online comic TOBlender.com.

Chinese translation by Wei Liu.

Source: http://code.google.com/p/memcached/wiki/TutorialCachingStory

Algorithm – N Kings

N Kings

You have to place the N kings in on N squared chess board so that no two kings are in same row and column and do not attack each other.

About input, first line of the input is number of testcases T. Then every next 2 lines are for the testcase. In the first line N is the size of chess board and K is the number of Kings already in place starting from first row. Next line has K numbers. Each “number”(pos) in position i is in the row i and column “number” in which king is already placed. 0 <= K <= N. 0 <= pos[i] <= N-1

About output: Output will be the number of possible ways kings can be placed modulus 1000000007.

Input:
2
3 0

4 1
2

Output:
0
1

 

Algorithm

Solved the problem recursively by setting up board and checking every column for each row if the current position is valid. You can see the solution as Depth First Search as we traverse up till the leaf first(solution). The algorithm uses Constraint Programming by default as the first row after the already placed pieces will always have minimum constraint that is will have less possible positions to start with.

 

Code

You can see the code here. Its in python(2.6.2).

Spidey: Python Web Crawler

I created a web crawler using python and its modules. It follows certain conditions like it reads robots.txt before crawling a page. If the robots.txt allows the page to be crawled the spidey crawls it. It dives in recursively. But there are certain limitations I have set. It do not go beyond 20 pages, as it is just a prototype. It cannot be detect traps, where it will go infinitely.

Spidey is a very basic crawler which works just fine with websites at least on the websites I have tested. I have tested it with http://python.org/ and http://stackoverflow.com/ and it did pretty well.

The modules I have used for the purpose are urllib2, re, BeautifulSoup, robotparser.

Here is the code if you want to test/use it.