© 2017 by Doran Bae 

Subscribe!
You are reading Coding Otter, stuffed with articles about big data, data science development, ML in production, managing data science projects, and many more. ​
About the host
I'm Doran Bae, Data Scientist @TVB turning data into products and stories. More about me.

Programming etiquettes for data scientists

Programming is one of the essential skillsets a data scientist needs. As there is more demand than supply of data scientists in the market, there is a good portion of data scientists who have acquired self-taught programming skills in the process of training as a data scientist. I am also this case.

The problem with self-taught programming skills is that while you may be street smart, but you may lack fundamental qualities which can only be taught in the classroom of CS101. Here is the list of things I have learned over the years.

1. Write your code so that it is maintainable (by others)

My team had a teammate, whom we normally refer to as a brilliant jerk. That teammate was innovative, on-time, and had otherwise wonderful qualities, except that the teammate did not care about code maintenance. After the teammate had left our team, we opened the teammate's code files and found this to our disbelief.

for _ in __:
    random_func_1(_)

Everyone just stared at the code and scratched our head because none of us could tell what this was supposed to do. After struggling with it a few days, we gave up and re-wrote entire module. This time, we did (a bit overly) heavy documentation.

It is very important to write your codes in a way that it can be maintained by others. When the code is in your laptop somewhere and it is only you who are going to look at, then formatting can be a bit relaxed. However, if there is a potential for your code to be shared with others, it is always a better idea to re-write your messy work into the more presentable format. This includes:

  • Use meaningful variable names

  • Use meaningful file names & make it coherent across your team

  • Use comments (better, document it)

  • Do not achieve conciseness in the expense of self-explainability

I want to stress the last bit. Even though it should always be your goal to write more precise and efficient code, it is important that you need to consider self-explainability of your code. If you must use concise format for a better performance, be sure to choose variable names that make sense to the readers. If not, make sure you leave comments to explain what you are trying to achieve.

2. Learn how to use Git (especially how to sort merge problems)

Using Git to do version control is a common practice in engineering. Few of us learn this in a hard way only after we somehow delete everyone's work. Committing is the easy part. You need to learn how to sort problems.

3. Optimize

I gave two similar tasks to two of my engineers to perform. The tasks were building a data pipeline ETL process to one of our existing data tables. One was a die-hard engineer from the core (I will call him E) and the other was a self-taught engineer (I will call him S). After few weeks, I checked their work and came to a surprising discovery.

S's code was much concise than E. S's code was about twice as long as the E's. Later, I found the reason why. S had achieved the goal. However, S had not considered the efficiency (or speed) of the task. When we had increased the data size, his pipeline broke down because it could not handle the increased data volume. E's code was longer however, however, it was done in a way that ensured stability.

In big data projects, often than not, you need to work in a dynamic environment. Your code must be able to sustain foreseeable risks. It takes time and experience to know what you are doing really. I strongly recommend studying on dynamic programming, bigO, and related topics so you can produce more quality result.

4. Understand the importance of QA

It seems almost impossible to do QA for data science projects. How can you tell if the result is intended to be correct? If the result looks sub-optimal, what kind of leverage do QA engineers have to determine whether it is a bad model or a bug?

If you are shipping your data product to the frontend, you must not skip QA process. You need to test the engineering QA (whether the module/API works or not). Then you need to conduct, what I like to call, quality QA. This means you need to manually check the result to see if it makes sense to the majority. It is almost impossible to check all of them, but at least do a good portion. You will be surprised to know that you can tune your model easily through this method.

Other useful reads: