© 2017 by Doran Bae 

You are reading Coding Otter, stuffed with articles about big data, data science development, ML in production, managing data science projects, and many more. ​
About the host
I'm Doran Bae, Data Scientist @TVB turning data into products and stories. More about me.

Five essential tips on leading your first data science project

These are the five essential tips I can offer to others who are about to lead a young data science teams. Please be minded that I come from the non-CS background so this piece may seem premature for some others. However, my observation in the industry has revealed that likes of me lead a great portion of data science projects! So if you are in the same situation as I am, may be my tips can guide you to make fewer mistakes than I have made.

1. Learn coding etiquette

My first data science team was comprised of engineers/analysts from many different backgrounds. Some held Computer Science bachelors degree, a fresh graduate with no Linux experience, and some (like myself) who were self-taught programmers. I am sure this will be the cases of many other data science teams out there. It is generally preferred to have a solid CS background, but not all data scientists come equipped with basic CS knowledge. This creates small problems that eventually can slow the team progress down in the long run.

It is very important that everyone in the team learns (or be reminded of) and practices coding etiquette. I am talking about variable naming conventions, documentations, coherent file names, etc. This goes to non-programmers as well as programmers. Even though you are a rock-star programmer, if you do not follow the simple rule set within your team, it breaks the teamwork eventually.

I have learned this the hard way. In the beginning, we were busy building the product, and I could not imagine a single badly named variable can hurt our model. After several months passed since the launch, we had to change our data source (believe me, data pipeline disruption happens more than you would hope for!), which meant we had to go back into our old code and make amends. We ended up spending weeks in this task and even after we had to deal with occasional bugs popping up here and there.

This is only a small example of many things that could go wrong because some of us did not follow the coding etiquette. If you don't know it, you need to learn it as soon as possible.

2. Do code review (or at least, review process together)

Fear not... (credit: MindBowser)

One of my friends, who is also a data scientist, once told me that it is very difficult to find an error in your data science project. I could not agree more. When data scientists build deep learning models, we are entirely unsure of what is going on under the hood.That is kind of the whole point of the neural network. It is very easy to make small mistakes and not notice it until much later. As long as a result looks good enough, the motivation is not high to do a thorough review of the modeling code.

Especially, if you don't feel like your coding skill is not as good as fellow engineers, it is intimidating to initiate the code review. I strongly encourage you to do the code review still because all works should be reviewed by more than one person (the author). If this poses too much problem, at least, do the process review together. Have the engineer go through the process in steps and have the engineer explain to you. If there is an error, you will be able to spot it while you follow the logic/process of the model even not reading the code line by line.

3. Do expectation management

The expectation is a tricky thing to manage. Expectation too high is problematic but expectation too low is also discouraging. In my experience, they come in pairs. In the very beginning of the project scope, the PM (or whoever makes the decision) expects you to create something magical. When it is about time to deploy to the front-end, PM suddenly gets cold feet and often say that they would like to hold off the launching until further proven.

This is discouraging not only to you but your team members. It is essential to anticipate that the legacy system will be hard to get rid of and your machine learning models may not get deployed. Do the expectation management outward and inward as well.

4. Simple is the best

You may be tempted to make a state-of-the-art model, but I strongly advise you to keep it simple in the beginning. Of course, this is all circumstantial. In my case, there were many hoops to jump before a machine learning model could be deployed: network instability, no way to capture machine learning feedback in real-time, no data, resistance from other teams, and did I mention no data?

Deploying a machine learning model to the client-side is like building a house. When you want to build a house, you need to buy the land first. Buying the right land in the right size, the right neighborhood, and at the right price is the key. After you have the land, then you can think about building something on it. In many cases, it is difficult to deploy a new module inside an already existing system. What you need to do in the beginning is to buy the land, or claim a screen on the platform. Start simple with easy algorithms such as ranking rules, etc. Once you own a part of the platform, then it becomes easier to build something on top of it.

Start simple, show results, and then improve the model step by step

5. Document everything

"What will I do if you get hit by a bus tomorrow?" As harsh as it sounds, but this is what I tell my teammates when I want them to documentation. No engineer (as a matter of fact, none of us!) likes to document what he/she did. I admit it; it is boring and tedious. Plus, it never gets updated.

You need to enforce your engineers to document all of their work. If they don't know how to do it, you need to set some simple templates so they can follow. Some engineers may not be creative, but certainly, all of them knows how to follow the rules/guidelines if they are enforced.