© 2017 by Doran Bae 

You are reading Coding Otter, stuffed with articles about big data, data science development, ML in production, managing data science projects, and many more. ​
About the host
I'm Doran Bae, Data Scientist @TVB turning data into products and stories. More about me.

What we do as a data science team

The field of data science is so vastly big; often, people fail to see that there are a lot more to than, say, inventing an artificial intelligence, such as Jarvis. Let me take it back; most data science teams don't make Jarvis. I don't think it is even possible as of today's technology to make something as complicated as Jarvis. In my previous experience, the most exciting part of the day as an data scientists are:

  • When I can set up Tensorflow-serving server (don't ask me to do this again)

  • When I understand the difference of active and inactive in 'user_status' column

  • When the machine learning accuracy is that bad (anything above the baseline is sufficient)

  • When the boss is on leave (the happiest moment, indeed)

With jokes aside, allow me to introduce what is expected from a data science team in a conventional company. I may be barely scratching the surface, but again, no one has scratched any surface on this matter in any detail, so I am just going to do that and be happy.

As a disclaimer, I should say that what I am about to tell you may not apply to all data science teams everywhere. Especially, not those IT companies who are leading the cutting-edge technology innovations. The following may apply to the companies who are the consumer of the big data technology and infrastructure.

1. Glorified data analysts

Unless your company is a startup in the recent five years, the chances are that your team is not that old. Data science team all start somewhere, usually as a part of the marketing department or analytics department. Moreover, they work on things what used to be data analysts' tasks with a flare of big data techniques.

You will be asked to build a dashboard using popular big data visualisation tools such as Google Analytics or Tableau. You may also be asked to do an in-depth analysis of subjects that your company cares about: user segmentation (using clustering), user behavior analysis using time-series, or profit optimisation strategy using Bayes theorem. You will be required to know at least how to work using SQL and some kind of data analysis package/language: R, SAS, or Python. I believe these tasks are all very important, as no decision can be made unless we understand what the data is telling us.

1. Build & ship data products

This is especially true if you work in a consumer-facing industry with some platform/product. If your team is required to deliver something to the front-end, then the chances are that your team is comprised of full-stack data scientists, who can do A-to-Z. This is the case of my current work setting. You will work with product managers (PMs) and define requirements, project scopes, and deadlines (one soft-deadline and one hard-decline).

In these kinds of tasks, data science part is over rather quickly, but you will spend more time perfecting data pipelines and APIs, ensuring data capture, and most importantly, doing the load test, and QA. It requires lots of CS skills (at least, you need to understand what is going on).

3. Data cleaning, cleaning, and cleaning

This is where I believe most data science teams spend most of their time: data migration and data cleansing. I spend a lot my time digging through data to investigate and determine whether the data is correct and reliable. We spent two months on migrating our data source from one solution to another solution. We have built numerous APIs just to check data's reliability real-time. We accumulate some data tables updates because we are afraid the data may not come tomorrow.

These tasks are very important, and it comes before everything else. Unfortunately, data cleaning never ends. There is always more data to be cleaned (or thrown away).

4. Research

This is what I want to spend more time, but do not get to because of 1, 2, 3 above. I'd like to devote a portion of my team's resource to research projects. These projects may not be directly related to the company's profit, but in the long term, are expected to benefit the team's understanding of data.