Companies are digitizing and pushing all their operational functions and workflows into IT systems that benefit from so-called ‘big data’ analytics. Using this approach, firms can start to analyze the massive firehose stream of data now being recorded by the Internet of Things (IoT) with its sensors and lasers designed to monitor physical equipment. They can also start to ingest and crunch through the data streams being produced in every corner of the what is now a software-driven data-driven business model.
All well and good, but who is going to do all this work? It looks like your company just had to establish a data engineering department.
Drinking from the data firehose
As a technologist, writer and speaker on software engineering, Aashu Virmani also holds the role of chief marketing officer at in-database analytics software company Fuzzy Logix (known for its DB Lytix product). Virmani claims that there’s gold in them thar data hills, if we know how to get at it. This is the point where firms start to realize that they need to invest in an ever larger army of data engineers and data scientists.
But who are these engineers and scientists? Are they engineers in the traditional sense with greasy spanners and overalls? Are they scientists in the traditional sense with bad hair and too many ballpoint pens in their jacket pockets? Not as such, obviously, because this is IT.
What the difference between a data scientist & a data engineer?
“First things first, let’s ensure we understand what the difference between a data scientist and a data engineer really is because, if we know this, then we know how best to direct them to drive value for the business. In the most simple of terms, data engineers worry about data infrastructure while data scientists are all about analysis,” explains Fuzzy Logix’s Virmani.
Boiling it down even more, one prototypes and the other deploys.
Is one role more important than the other? That’s a bit like asking whether a fork is more important than a knife. Both have their purposes and both can operate independently. But in truth, they really come into their own when used together.
What makes a good data scientist?
“They (the data scientist) may not have a ton of programming experience but their understanding of one or more analytics frameworks is essential. Put simply, they need to know which tool to use (and when) from the tool box available to them. Just as critically, they must be able to spot data quality issues because they understand how the algorithms work,” said Virmani.
He asserts that a large part of their role is hypothesis testing (confirming or denying a well-known thesis) but the data scientist that knows their stuff will impartially let the data tell its own story.
Virmani continued, “Visualizing the data is just as important as being a good statistician, so the effective data scientist will have knowledge of some visualization tools and frameworks to, again, help them tell a story with the data. Lastly, the best data scientists have a restless curiosity which compels them to try and fail in the process of knowledge discovery.”
What makes a good data engineer?
To be effective in this role, a data engineer needs to know the database technology. Cold. Teradata, IBM, Oracle, Hadoop are all ‘first base’ for the data engineer you want in your organization.
“In addition to knowing the database technology, the data engineer has an idea of the data schema and organization – how their company’s data is structured, so he or she can put together the right data sets from the right sources for the scientist to explore,” said Virmani.
The data engineer will be utterly comfortable with the ‘pre’ and ‘post’ tasks before data science will even occur. The ‘pre’ tasks mostly deal with what we call ETL – Extract, Transform, Load.
Virmani continued, “Often it may be the case that the data science is happening not in the same platform, but an experimental copy of the database and often in a small subset of the data. It is also frequently the case that IT may own the operational database and may have strict rules on how/when the data can be accessed. A data science team needs a ‘sandbox’ in which to play – either in the same DB environment, or in a new environment intended for data scientists. A data engineer makes that possible. Flawlessly.”
Turning to ‘post’ tasks, once the data science happens (say, a predictive model is built that determines ‘which credit card transactions are fraudulent’), the process needs to be ‘operationalized’. This requires that the analytic model developed by the data scientists be moved from the ‘sandbox’ environment to the real production/operational database, or transaction system. The data engineer is the role that can take the output of the data scientist and help put this into production. Without this role, there will be tons of insights (some proven, some unproven) but nothing put into production to see if the model is providing the business value in real time or not.
Ok, so you now understand what ‘good’ looks like in terms of data scientists and engineers but how do you set them up for success?
How to make your big data team work
The first and most important factor here is creating the right operational structure to allow both parties to work collaboratively and to gain value from each other. Both roles function best when supported by the other so create the right internal processes to allow this to happen.
Fuzzy Logix’s Virmani heeds that we should never let this not become a tug of war between the CIO and the CDO (chief data officer) – where the CDO’s organization just wants to get on with the analysis/exploration, while the IT team wants to control access to every table/row there is (for what may be valid reasons).
“Next, invest in the right technologies to allow them to maximize their time and to focus in the right areas. For example, our approach at Fuzzy Logix is to embed analytics directly into applications and reporting tools so allowing data scientists to be freed up to work on high value problems,” he said.
Don’t nickel & dime on talent
Speaking to a number of firms in the big data space trying to establish big data teams, one final truth appears to resonate — don’t nickel and dime on talent. These roles are new (comparatively) and if you pay for cheap labor then most likely you’re not going to get data engineering or data science gold.
Fuzzy Logix offers in-database and GPU-based analytics solutions built on libraries of over 600 mathematical, statistical, simulation, data mining, time series and financial models. The firm has an (arguably) non-corporate (relatively) realistic take on real world big data operations and this conversation hopefully sheds some light on the internal mechanics of a department that a lot of firms are now working to establish.