Waiting for Data

Should I Wait for the Data?

An organization without data science often realizes they should use data science, but how to start becomes the issue. There are essentially two basic ideas when it comes to building a data science team:

  • We will start data science once we get enough data.
  • We will build data science and the data will come.

Data scientists are oftentimes not involved in this decision because they do not exist yet, so the decision tends to be made without data science expertise.

Wait for the Data

Generally speaking, waiting is not good. The sooner you can build capability the better, and the best way to learn is by doing. Many data science initiatives are ambitious and will have numerous dependencies that are vaguely understood. So while you may have a strong plan for the final data science product the details that go into creating the final product will end up causing significant problems. The waiting for data gotchas can include:

  • The level of effort to pipe, clean, and format the data to a data science platform is much greater than anticipated.
  • The team tooling is not setup and tested causing significant delays as the organization learns how to setup the new type of tools and access needed for data science.
  • Missed opportunity to test assumptions on the system you will be conducting data science on.
  • Staffing the team - the ratio of data science to data engineer can be tricky to balance.

Many times the prepared data set samples for proposing new data science initiatives do not adequately represent the level of effort that it will take to reliably feed the model. A one-off demonstration only shows the data exists; it does not outline the infrastructure and code development that will be required to pipe that data to your model consistently. Your team will also need unpredicted resources, such as excess computations, new database(s), orchestration tooling, streaming capability, monitoring, and much more. The sooner you get in there the sooner you will learn what you need.

Data science can be useful in almost all organizations, but all organizations control access to IT differently. I have tried to help numerous new data science teams that cannot install programming languages on their computers due to security reasons, denied access to the data they need because they are not in the IT department, or the team is not experienced in system administration so they cannot figure out how to install the programs and packages they need. In all of these cases, the data science teams were delayed significantly and the organization had to make significant changes to enable their data science teams. In one case, they were not able to make the changes fast enough and the data science team was dissolved. The sooner you can identify these issues and work toward finding a solution the better.

Everyone wants to think their data science projects will be built successfully and drive significant change within the organization, but before one can do this you must validate that there is statistical significance in the assertions. It is said that Amazon’s first book recommendation model required a customer to view over 20 books before it could generate book recommendations. Stop and consider this. How many websites have people visit their site long enough to track over 20 events? And if they do have 20 events, are they similar enough to predict anything? So for a proposed recommendation engine, one would want to assess how many users used their site and of these users how many clicked and spent enough time on the site to produce data that will be useful to make a model? Or for industrial applications, do the sensors of interest show any type of indicator for the event you are interested in? Focusing on the available data and the statistical significance are huge in determining how likely your future model will work. By finding this out earlier, you can start making changes to operations in order to get the data you need. (Or it gives you time to buy third-party data!)

Staffing the team is always tricky and depending on the type of data science, company, and industry, this can vary. For large website data science teams, I generally planned for three data engineers for every one data scientist. We would spend much more time moving, cleaning, and formatting data, and if there weren’t enough data engineers the data scientist would run out of data. Determining the team will also depend on the first bullet point. Another important factor to consider is the type of work you will be doing. If the infrastructure is developed and you have the support of other teams or departments then you can focus on data engineering and data science, but if not you may need other skill sets in your teams.

Build it and they will come

This approach will address the many issues discussed in the “waiting for data” section, but there are reasons to wait, such as:

  • Your team is experienced in data science and has deployed projects in the past.
  • Organizational change management is needed for data science to have a place.

If you have an experienced team that is comfortable with all the details that go into establishing data science projects many times it makes more sense to deploy them on projects with more immediate gains. A data science team is expensive and the more they can engage and show results the better. Plus the level of engagement and interest is an important factor to consider when managing these types of teams. The members of data science teams are smart, motivated, and in high demand therefore their needs and interests have to be taken into account. It is not uncommon to lose teams because the challenges of the organization delay or freeze data science projects and they seek opportunities elsewhere. But if you have numerous data science initiatives and confidence in your team waiting for data and keeping your team engaged in relevant projects elsewhere may be a good decision.

Other times you will have to wait for the data because your focus will be on the management side of data science. Getting the senior management team to endorse a data science initiative is one thing, but then you must work it through all levels of the organization. This can be very challenging and take a considerable amount of time and patience. It will also likely be a challenge to get proper access to the data you need, admin-level access in areas of the network/infrastructure that your team will work in, explanations of how data is used and analyzed, and how to tie in your model outputs into the production side of the system. There will be many “campaigns” to get what your team requires and it can be an uphill battle that will really hurt morale if the team is victim to every loss of the management campaigns. If this type of change management is going to be a significant element of your data science initiative, you should definitely consider waiting until you get to a point where the team can work with some level of stability.

Deciding - Technical vs. Organizational Challenges

There is no right answer for how to start a data science team, and it is not guaranteed that a data science team will be successful. As a champion for data science, you will have to balance both the organizational and technical demands. The experience of your data science team, senior management’s general understanding of data science, availability of resources and funding, quality of data, and type of industry will all be important factors that will affect the outcome. You may throw up your hands in frustration at this point and say “I don’t know data science. How can I navigate all of these details?” To that, I would say shift your focus. You are a manager and you need to make sure you understand the needs of your data scientists and organization, so learn how to get the things that your data science team needs. Starting a data science team is about building and showing results, so find low-hanging fruit and promote it. If other departments show interest, consider helping them with data science and start building organic support from within. The first phase of data science may not be the highly advanced models we read about in Medium, but they will help everyone learn how to conduct data science and, more importantly, how to use data science.