Understanding Data Roles

AnalyticsAnywhereWith the rise of Big Data has come the accompanying explosion in roles that in some way involve data. Most who are in any way involved with enterprise technology are at least familiar with them by name, but sometimes it’s helpful to look at them through a comprehensive lens that shows us how they all fit together. In understanding how data roles mesh, think about them in terms of two pools: one is responsible for making data ready for use, and another one that puts that data to use. The latter function includes the tightly-woven roles of Data Analysts and Data Scientist, and the former includes such roles as Database Administrator, Data Architect and Data Governance Manager.

Ensuring the data is ready for use

Making Sure the Engine Works.

A car is only as good as its engine, and according to PC Magazine the Database Administrator (DBA), is “responsible for the physical design and management of the database and for the evaluation, selection and implementation of the DBMS.” Techopedia defines the position as one that “directs or performs all activities related to maintaining a successful database environment.” A DBA’s responsibilities include security, optimization, monitoring and troubleshooting, and ensuring the needed capacity to support activities. This of course requires a high level of technical expertise–particularly in SQL, and increasingly in NoSQL. But while the role may be technical, TechTarget maintains that it may require managerial functions, including “establishing policies and procedures pertaining to the management, security, maintenance, and use of the database management system.”

Directing the Vision. With the database engines in place, the task becomes one of creating an infrastructure for taking in, moving and accessing the data. If the DBA builds the car, then the Enterprise Data Architect (EDA) builds the freeway system, laying the framework for how data will be stored, shared and accessed by different departments, systems and applications, and aligning it to business strategy. Bob Lambert describes the skills as including an understanding of the system development life cycle; software project management approaches; data modeling, database design, and SQL development. The role is strategic, requiring an understanding of both existing and emerging technologies (NoSQL databases, analytics tools and visualization tools), and how those may support the organization’s objectives. The EDA’s role requires knowledge sufficient to direct the components of enterprise architecture, but not necessarily practical skills of implementation. With that said, Monster.com lists typical responsibilities as: determining database structural requirements, defining physical structure and functional capabilities, security, backup, and recovery specifications, as well as installing, maintaining and optimizing database performance.

Creating and Enforcing the Rules of Data Flow. A well-architected system requires order. A Data Governance Manager organizes and streamlines how data is collected, stored, shared/accessed, secured and put to use. But don’t think of the role as a traffic cop–the rules of the road are there to not only prevent ‘accidents’, but also to ensure efficiency and value. The governance manager’s responsibilities include enforcing compliance, setting policies and standards, managing the lifecycle of data assets, and ensuring that data is secure, organized and able to be accessed by–and only by– appropriate users. By so doing, the data governance manager improves decision-making, eliminates redundancy, reduces risk of fines/lawsuits, ensures security of proprietary and confidential information, so the organization achieves maximum value (and minimum risk). The position implies at least a functional knowledge of databases and associated technologies, and a thorough knowledge of industry regulations (FINRA, HIPAA, etc.).

Making Use of the Data

We create a system in which data is well-organized and governed so that the business can make maximum use of it by informing day-to-day processes, and deriving insight from data analysts/scientists to improve efficiency or innovation.

Understand the past to guide future decisions. A Data Analyst performs statistical analysis and problem solving, taking organizational data and using it to facilitate better decisions on items ranging from product pricing to customer churn. This requires statistical skills, and critical thinking to draw supportable conclusions. An important part of the job is to make data palpable to the C-suite, so an effective analyst is also an effective communicator. MastersinScience.org refers to data analysts as “data scientists in training” and points out that the line between the roles are often blurred.

Data scientist–Modeling the Future. Data scientists combine advanced mathematical/statistical abilities with advanced programming abilities, including a knowledge of machine learning, and the ability to code in SQL, R, Python or Scala. A key differentiator is that where the Data Analyst primarily analyzes batch/historical data to detect past trends, the Data Scientist builds programs that predict future outcomes. Furthermore, data scientists are building machine learning models that continue to learn and refine their predictive ability as more data is collected.

Of course, as data becomes increasingly the currency of business, as it is predicted to, we expect to see more roles develop, and the ones just described evolve significantly. In fact, we haven’t even discussed one of a role that is now mandated by the EU’s GDPR initiative: The Chief Data Officer, or ‘CDO’.

Source: datasciencecentral.com


The Ultimate Data Set


Until recently, using entire populations as data sets was impossible—or at least impractical—given limitations on data collection processes and analytical capabilities. But that is changing.

The emerging field of computational social science takes advantage of the proliferation of data being collected to access extremely large data sets for study. The patterns and trends in individual and group behavior that emerge from these studies provide “first facts,” or universal information derived from comprehensive data rather than samples.

“Computational social science is an irritant that can create new scientific pearls of wisdom, changing how science is done,” says Brian Uzzi, a professor of management and organizations at the Kellogg School. In the past, scientists have relied primarily on lab research and observational research to establish causality and create descriptions of relationships. “People who do lab studies are preoccupied with knowing causality,” Uzzi says. “Computational work says, “I know that when you see X, you see Y, and the reason why that happens may be less important than knowing that virtually every time you see X, you also see Y.”

“Big data goes hand in hand with computational work that allows you to derive those first facts,” Uzzi says. “Instead of trying to figure out how scientists come up with great ideas by looking at 1,000 scientists, you look at 12,000,000 scientists—potentially everyone on the planet. When you find a relationship there, you know it’s universal. That universality is the new fact on which science is being built.”


Computation in the Social Sphere

Studying large data sets for first facts about human behavior has led to striking advances in recent years. Uzzi notes how one particular data set—mobile-phone data—“has taught us very distinctively about human mobility and its implications for economical and social stratification in society.” It has also shed light on how people behave during evacuations and emergency situations, including infectious-disease outbreaks. Knowing how behaviors affect the spread of diseases can help public health officials design programs to limit contagion.

The ability to track the social behavior of large groups has also shifted people’s understanding of human agency. “Until recently, we really believed that each of us made our decisions on our own,” Uzzi says. “Our friends may have influenced us here or there but not in a big way.” But troves of social-media data have shown that people are incredibly sensitive and responsive to what other people do. “That’s often the thing that drives our behavior, rather than our own individual interests or desires or preferences.”

This may change how you think about your consumer behavior, your exercise regimen, or what you Tweet about. Researchers like Uzzi are also deeply interested in how this responsiveness influences political behavior on larger issues like global climate change or investments in education systems. Think of it as a shift from envisioning yourself as a ruggedly individual, purely rational, economic person to a sociological person who encounters and engages and decides in concert with others.

One aspect of computational social science—brain science—has already discovered that those decisions are often being made before we even know it. “Brain science has taught us a lot about how the brain reacts to stimuli,” Uzzi says. With the visual part of your brain moving at roughly 8,000 times the speed of the rest of your brain, the visual cortex has already begun processing information—and leaping to certain conclusions—before the rest of your brain ever catches up. And with 40 percent of the brain’s function devoted strictly to visualization, “if you want to get in front of anything that’s going to lead to a decision, an act of persuasion, an in-depth engagement with an idea, it has got to be visual.”

“The really big things are understanding how something diffuses through a population and how opinions change,” Uzzi says. “If you put those two things together, you really have an understanding of mass behavior.”

This increased understanding of both mass and individual behavior presents huge opportunities for businesses, notably in the health sphere. “There is going to be an entirely new ecology of business that goes beyond how we think about health today,” Uzzi says. “For many people, there is no upper threshold on what they will pay for good health and beauty. With health increasingly decentralized to the individual, that’s going to spin off to companies that want to take advantage of this information to help people do things better.”

Scaling from One to Everyone

While gathering data on groups as large as the entire population is beneficial to scientists, marketers, and the like, computational social science has the scalability to allow for practical data generation on an individual level as well. This means that you can be the subject of your own data-rich computational study, without control groups or comparison testing. “You actually generate enough data on yourself, every day, that could be collected, that you can be the subject of a computational study,” Uzzi says.

Developments in the ability to collect and parse data on individuals is one area where computational social science has the potential to transform people’s lives—from providing more information about individuals’ own health to raising their awareness of unconscious biases to showing how their decision-making processes are influenced by others. “It’s going to allow people to personally use data that can help them improve their lives in a way that they never imagined before,” Uzzi says.

For example, using wearable technologies allows for sensor data collection that can include emotional activation and heart-rate monitoring in social interactions, caloric intake, biorhythms, and nervous energy. The crunching of that raw data into actionable information will happen through our machines. If you think you have a close connection to your smartphone and your tablet now, wait until you rely on it to tell you how much that last workout helped—or did not help—you shake off the tension of a long day at the office.

“Our closest partnership in the world is probably going to be our machine that helps us manage all this,” Uzzi says. This can be transformative by making us healthier.

It may make us less discriminatory, too. We all have cognitive biases that lead us to make irrational decisions. These are thought to be hard-wired things we can identify but not necessarily change on our own. Sensor data can provide a feedback loop of how we have acted in the past. This has the potential to improve future decision making. If your sensors pick up signals that show your body acting differently around certain groups, perhaps in ways that you suppress or to which you are oblivious, that may be harder to ignore.

“Our own sense of identity could be greatly shaken by this, or improved, or both.”

Source: Kellogg Insight

The Big (Unstructured) Data Problem


The face of data breaches changed last year. The one that marked that change for me was the breach that involved former Secretary of State Colin Powell’s Gmail account. Targeted to expose the Hillary Clinton campaign, Colin Powell’s emails were posted on DCLinks.com for everyone to read. One of them had an attachment listing Salesforce’s acquisition targets and the details of its M&A strategy. Colin Powell, a member of Salesforce’s board, had access, through his personal email account, to sensitive information. When his personal email was hacked, all of that sensitive information was exposed — and blasted out in the headlines.

Corporations are trying to lock down sensitive information, most of it in structured systems and in data centers with a variety of security solutions. As it is getting harder for hackers to get to the data they want, they are finding the weakest path to that data and evolving their attack vector. Unstructured data is that new attack vector.

Most enterprises do not understand how much sensitive data they have, and when we consider how much unstructured data (emails, PDFs and other documents) a typical enterprise has under management, the red flags are clear and present. Analysts at Gartner (gated) estimate that upward of 80% of enterprise data today is unstructured. This is a big data problem, to say the least. As the level of unstructured data rises and hackers shift their focus to it, unstructured data is an issue that can no longer be placed on the enterprise IT back burner.

What Exactly Is Unstructured Data?

Unstructured data is any data that resides in emails, files, PDFs or documents. Sensitive unstructured data is usually data that was first created in a protected structured system such as SAP Financials for example and then exported into an Excel spreadsheet for easier consumption by audiences who are not SAP users.

Let me give you a very common example in any public company: Every quarter, a PR department receives the final quarterly financial numbers via email ahead of the earnings announcement in order to prepare a press release. The PR draft will be shared via email by a select group within the company before being approved and ready to be distributed out on the news wires. When pulling that financial information from the ERP system — a system that usually lives behind the corporate firewall with strong security and identity controls in place and with business owners who govern access to the systems and data within — we’ve instantly taken that formerly safe data and shared it freely by email as an Excel file.

A hacker could easily try to hack the credentials of a key employee rather than break into the network and tediously make his or her way to the ERP system. The path to getting the coveted earnings data can be easily shortened by focusing on its unstructured form shared via email or stored in files with limited security.

Right now, enterprises are woefully unprepared. Nearly 80% of enterprises have very little visibility into what’s happening across their unstructured data, let alone how to manage it. Enterprises are simply not ready to protect data in this form because they don’t understand just how much they have. Worse yet, they don’t even know what lies within those unstructured data files or who owns these files. Based on a recent survey created by my company, as many as 71% of enterprises are struggling with how to manage and protect unstructured data.

This is especially concerning when we consider the looming General Data Protection Regulation (GDPR) deadline. When that regulation takes effect in May 2018, any consumer data living in these unmanaged files that is exposed during a breach would immediately open the organization up to incredibly steep penalties. While regulations like GDPR put fear into companies, it may be a while before they start to take action. Many companies are struggling to strike the right balance between focusing on reacting to security threats versus time spent evaluating the broader picture of proactively managing risk for their company.

The Path Forward

Enterprises simply cannot afford to ignore the big unstructured data problem any longer. They need an actionable plan, one that starts with this four-step process:

•Find your unstructured data. Sensitive data is most likely spread out across both structured systems (i.e., your ERP application) and unstructured data (i.e., an Excel spreadsheet with exported data from your ERP app) that lives in a file share or the numerous cloud storage systems companies use today for easier cross-company sharing and collaboration.
•Classify and assign an owner to that data. Not all data has value, but even some stale data may still be of sensitive nature. Take the time to review all data and classify it to help you focus only on the most sensitive areas. Then assign owners to the classified unstructured data. If you do not know whom it belongs to, ask the many consumers of that data; they usually always point in the same direction — its natural owner.
•Understand who has access to your data. It’s extremely important to understand who has access to all sensitive company information, so access controls need to be placed on both structured and unstructured data.
•Put parameters around your data. Sensitive data should be accessed on a “need to know” basis, meaning only a select few in the company should have regular access to your more sensitive files, the ones that could have serious consequences if they ended up in the wrong hands.

With these steps in place, you can better avoid anyone within your company from having access to a file that they don’t need to do their job and ultimately minimize the risk of a breach. And although there are data access governance solutions that help corporations protect unstructured data, very few enterprises today have such a program in place. Ultimately, these solutions will need to find their way into enterprises as hackers once again change their attack vector to easier prey.

Source: Forbes

Using Cell Phone Data to Predict the Next Epidemic

Whom you call is linked to where you travel, which dictates how viruses spread.Champion

Can big data about whom we call be used to predict how a viral epidemic will spread?

It seems unlikely. After all, viruses do not spread over a cell network; they need us to interact with people in person.

Yet, it turns out that the patterns in whom we call can be used to predict patterns in where we travel, according to new research from Kellogg’s Dashun Wang. This in turn can shed light on how an epidemic would spread.

Both phone calls and physical travel are highly influenced by geography. The further away a shopping mall or post office is from our home, after all, the less likely we are to visit it. Similarly, our friends who live in the neighborhood are a lot likelier to hear from us frequently than our extended family in Alberta.

But Wang and colleagues were able to take this a step further. By analyzing a huge amount of data on where people travel and whom they call, they were able to determine the mathematical formula that illustrates the link between how distance impacts these two very different activities. This understanding provides a framework for using data about long-distance interactions to predict physical ones—and vice versa.

As humans, we do not like to think that someone could anticipate our actions, says Wang, an associate professor of management and organizations. But his evidence says otherwise. “It’s just fascinating to see this kind of deep mathematical relationship in human behavior,” he says.

Wang’s conclusions were based on the analysis of three massive troves of cell phone data collected for billing purposes. The data, from three nations spanning two continents, included geographic information about where cell phone users traveled, as well as information about each phone call placed or received, and how far a user was from the person on the other end of the line.

The discovery of this underlying relationship between physical and nonphysical interactions has significant practical implications. For example, the researchers were able to model the spread of a hypothetical virus, which started in a few randomly selected people and then spread to others in the vicinity, using only the data about the flow of phone calls between various parties. Those predictions were remarkably similar to ones generated by actual information about where users traveled and thus where they would be likely to spread or contract a disease.

“I think that’s a great example to illustrate the opportunities brought about by big data,” Wang says. “The paper represents a major step in our quantitative understanding of how geography governs the way in which we are connected. These insights can be particularly relevant in a business world that is becoming increasingly interconnected.”

Source: Kellogg Insight

Why Big Data Will Revolutionize B2B Marketing Strategies

B2B, or business to business marketing, involves selling of a company’s services or products to another company. Consumer marketing and B2B marketing are really not that different. Basically, B2B uses the same principles to market its product but the execution is a little different. B2B buyers make their purchases solely based on price and profit
Why Big Data Will Revolutionize B2B Marketing Strategies | Innovation Management

B2B, or business to business marketing, involves selling of a company’s services or products to another company. Consumer marketing and B2B marketing are really not that different. Basically, B2B uses the same principles to market its product but the execution is a little different. B2B buyers make their purchases solely based on price and profit potential while consumers make their purchases based on emotional triggers, status, popularity, and price. B2B is a large industry.

The fact that more than 50 percent of all economic activity in the United States is made up of purchases made by institutions, government agencies, and business gives you a perspective of the size of this industry. Technological advancements and the internet has given B2Bs new ways to make sense of their big data, learn about prospects, and improve their conversion rates. Innovations such as marketing automation platforms and marketing technology — sometimes referred to as ‘martech’ — will revolutionize the way B2B companies market their products. They will be able to deliver mass personalization and nurture leads through the buyer’s journey.

In the next few years, these firms will be spending 73% more on marketing analytics. What does this mean for B2B marketing? The effects of new technology on B2B marketing will be more pronounced in some key areas. These are:

Lead Generation

In the old days, businesses had to spend fortunes on industry reports and market research to find how and to whom to market their products. They had to build their marketing efforts based on what their existing customer base seems to like. However, growing access to technology and analytics has made revenue attribution and lead nurturing a predictable, measurable, and a more structured process. While demand generation is an abstraction or a form of art (largely depends on who you ask), lead generation is a repeatable scientific process. This means less guesswork and more revenue.

Small Businesses

Thanks to SaaS (software-as-a-service) revolution, technologies once only available to elite firms—revenue reporting, real-time web analytics, and marketing automation — are now accessible and affordable to businesses of all sizes. Instead of attempting to build economies of scale, smaller businesses are using the power of these innovations to give the bigger competition a run for their dough. With SaaS, small businesses can now narrow their approaches and zero in on key accounts.

In the context of business to business marketing, this means that instead of trying to attract unqualified, uncommitted top-tier leads, these companies will go after matched stakeholders and accounts and earn their loyalty by providing exceptional customer experiences.

Data Analytics

A few years ago, data was the most underutilized asset in the hands of a marketer. That has since changed. Marketers are quickly coming to the realization that when it comes to their trade, big data is now more valuable than ever — measuring results, targeting prospects, and improving campaigns — and are in search of more ways to exploit it. B2B marketing is laden with new tools that capitalize on data points. These firms use data scraping techniques and tools to customize their sites for their target audiences. Business can even use predictive lead scoring to gauge the performance of leads in the future. Apache Kafka provides a Distributed Streaming Platform for building a real-time data pipeline in addition to streaming mobile apps.


The integration of marketing automation and CRM has made it easier for B2Bs to track and measure marketing campaign efforts through revenue marketing. It has always been hard for firms to calculate their return on marketing investment (ROMI).

Technological advancements have some exciting parallels in the B2B industry. In order to exploit this technology and gain a competitive edge, companies have to stay up to date. The risk involved is very minimal so these firms have absolutely nothing to worry about.

Source: Innovation Management

How To Build A Big Data Engineering Team


Companies are digitizing and pushing all their operational functions and workflows into IT systems that benefit from so-called ‘big data’ analytics. Using this approach, firms can start to analyze the massive firehose stream of data now being recorded by the Internet of Things (IoT) with its sensors and lasers designed to monitor physical equipment. They can also start to ingest and crunch through the data streams being produced in every corner of the what is now a software-driven data-driven business model.

All well and good, but who is going to do all this work? It looks like your company just had to establish a data engineering department.

Drinking from the data firehose
As a technologist, writer and speaker on software engineering, Aashu Virmani also holds the role of chief marketing officer at  in-database analytics software company Fuzzy Logix (known for its DB Lytix product). Virmani claims that there’s gold in them thar data hills, if we know how to get at it. This is the point where firms start to realize that they need to invest in an ever larger army of data engineers and data scientists.

But who are these engineers and scientists? Are they engineers in the traditional sense with greasy spanners and overalls? Are they scientists in the traditional sense with bad hair and too many ballpoint pens in their jacket pockets? Not as such, obviously, because this is IT.

What the difference between a data scientist & a data engineer?
“First things first, let’s ensure we understand what the difference between a data scientist and a data engineer really is because, if we know this, then we know how best to direct them to drive value for the business. In the most simple of terms, data engineers worry about data infrastructure while data scientists are all about analysis,” explains Fuzzy Logix’s Virmani.

Boiling it down even more, one prototypes and the other deploys.

Is one role more important than the other? That’s a bit like asking whether a fork is more important than a knife. Both have their purposes and both can operate independently. But in truth, they really come into their own when used together.

What makes a good data scientist?
“They (the data scientist) may not have a ton of programming experience but their understanding of one or more analytics frameworks is essential. Put simply, they need to know which tool to use (and when) from the tool box available to them. Just as critically, they must be able to spot data quality issues because they understand how the algorithms work,” said Virmani.

He asserts that a large part of their role is hypothesis testing (confirming or denying a well-known thesis) but the data scientist that knows their stuff will impartially let the data tell its own story.

Virmani continued, “Visualizing the data is just as important as being a good statistician, so the effective data scientist will have knowledge of some visualization tools and frameworks to, again, help them tell a story with the data. Lastly, the best data scientists have a restless curiosity which compels them to try and fail in the process of knowledge discovery.”

What makes a good data engineer?
To be effective in this role, a data engineer needs to know the database technology. Cold. Teradata, IBM, Oracle, Hadoop are all ‘first base’ for the data engineer you want in your organization.

“In addition to knowing the database technology, the data engineer has an idea of the data schema and organization – how their company’s data is structured, so he or she can put together the right data sets from the right sources for the scientist to explore,” said Virmani.

The data engineer will be utterly comfortable with the ‘pre’ and ‘post’ tasks before data science will even occur.  The ‘pre’ tasks mostly deal with what we call ETL – Extract, Transform, Load.

Virmani continued, “Often it may be the case that the data science is happening not in the same platform, but an experimental copy of the database and often in a small subset of the data. It is also frequently the case that IT may own the operational database and may have strict rules on how/when the data can be accessed.  A data science team needs a ‘sandbox’ in which to play – either in the same DB environment, or in a new environment intended for data scientists. A data engineer makes that possible. Flawlessly.”

Turning to ‘post’ tasks, once the data science happens (say, a predictive model is built that determines ‘which credit card transactions are fraudulent’), the process needs to be ‘operationalized’.  This requires that the analytic model developed by the data scientists be moved from the ‘sandbox’ environment to the real production/operational database, or transaction system. The data engineer is the role that can take the output of the data scientist and help put this into production. Without this role, there will be tons of insights (some proven, some unproven) but nothing put into production to see if the model is providing the business value in real time or not.

Ok, so you now understand what ‘good’ looks like in terms of data scientists and engineers but how do you set them up for success?

How to make your big data team work
The first and most important factor here is creating the right operational structure to allow both parties to work collaboratively and to gain value from each other. Both roles function best when supported by the other so create the right internal processes to allow this to happen.

Fuzzy Logix’s Virmani heeds that we should never let this not become a tug of war between the CIO and the CDO (chief data officer) – where the CDO’s organization just wants to get on with the analysis/exploration, while the IT team wants to control access to every table/row there is (for what may be valid reasons).

“Next, invest in the right technologies to allow them to maximize their time and to focus in the right areas. For example, our approach at Fuzzy Logix is to embed analytics directly into applications and reporting tools so allowing data scientists to be freed up to work on high value problems,” he said.

Don’t nickel & dime on talent

Speaking to a number of firms in the big data space trying to establish big data teams, one final truth appears to resonate — don’t nickel and dime on talent. These roles are new (comparatively) and if you pay for cheap labor then most likely you’re not going to get data engineering or data science gold.

Fuzzy Logix offers in-database and GPU-based analytics solutions built on libraries of over 600 mathematical, statistical, simulation, data mining, time series and financial models. The firm has an (arguably) non-corporate (relatively) realistic take on real world big data operations and this conversation hopefully sheds some light on the internal mechanics of a department that a lot of firms are now working to establish.

Source: Forbes

Big Data Is Filling Gender Data Gaps—And Pushing Us Closer to Gender Equality


Imagine you are a government official in Nairobi, working to deploy resources to close educational achievement gaps throughout Kenya. You believe that the literacy rate varies widely in your country, but the available survey data for Kenya doesn’t include enough data about the country’s northern regions. You want to know where to direct programmatic resources, and you know you need detailed information to drive your decisions.

But you face a major challenge—the information does not exist.

Decision-makers want to use good data to inform policy and programs, but in many scenarios, quality, complete data is not available. And though this is true for large swaths of people around the world, this lack of information acutely impacts girls and women, who are often overlooked in data collection even when traditional surveys count their households. If we do not increase the availability and use of gender data, policymakers will not be able to make headway on national and global development agendas.

Gender data gaps are multiple and intersectional, and although some are closing, many persist despite the simultaneous explosion of new data sources emerging from new technologies. So, what if there was a way to utilize these new data sources to count those women and girls, and men and boys, who are left out by traditional surveys and other conventional data collection methods?

Big Data Meets Gender Data
“Big data” refers to large amounts of data collected passively from digital interactions with great variety and at a high rate of velocity. Cell phone use, credit card transactions, and social media posts all generate big data, as does satellite imagery which captures geospatial data.

In recent years, researchers have been examining the potential of big data to complement traditional data sources, but Data2X entered this space in 2014 because we observed that no one was investigating how big data could help increase the scope, scale, and quality of data about the lives and women and girls.

Data2X is a collaborative technical and advocacy platform that works with UN agencies, governments, civil society, academics, and the private sector to close gender data gaps, promote expanded and unbiased gender data collection, and use gender data to improve policies, strategies, and decision-making. We host partnerships which draw upon technical expertise, in-country knowledge, and advocacy insight to tackle and rectify gender data gaps. Across partnerships, this work necessitates experimental approaches.

And so, with this experimental approach in-hand, and with support from our funders, the William and Flora Hewlett Foundation and the Bill & Melinda Gates Foundation, Data2X launched four research pilots to build the evidence base for big data’s possible contributions to filling gender data gaps.

Think back to the hypothetical government official in Kenya trying to determine literacy rates in northern Kenya. This time, a researcher tells her that it’s possible – that by using satellite imagery to identify correlations between geospatial elements and well-being outcomes, the researcher can map the literacy rate for women across the entire country.

This is precisely what Flowminder Foundation, one of the four partner organizations in Data2X’s pilot research, was able to do. Researchers harnessed satellite imagery to fill data gaps, finding correlations between geospatial elements–such as accessibility, elevation, or distance to roads–and social and health outcomes for girls and women (as reported in traditional surveys) – such as literacy, access to contraception, and child stunting rates. Flowminder then mapped these phenomena, displaying continuous landscapes of gender inequality which can provide policymakers with timely information on regions with greatest inequality of outcomes and highest need for resources.

This finding, and many others, are outlined in a new Data2X report, “Big Data and the Well-Being of Women and Girls,” which for the first time showcases how big data sources can fill gender data gaps and inform policy on girls’ and women’s lives. In addition to the individual pilot research findings outlined in the report, there are four high-level takeaways from this first phase of our work:

Country Context is Key: The report affirms that in developing and implementing approaches to filling gender gaps, country context is paramount – and demands flexible experimentation. In the satellite imagery project, researchers’ success with models varied by country: models for modern contraceptive use performed strongly in Tanzania and Nigeria, whereas models for girls’ stunting rates were inadequate for all but one pilot country.

To Be Useful, Data Must Be Actionable: Even with effective data collection tools in place, data must be demand-driven and actionable for policymakers and in-country partners. Collaborating with National Statistics Offices, policymakers must articulate what information they need to make decisions and deploy resources to resolve gender inequalities, as well as their capacity to act on highly detailed data.

One Size Doesn’t Fit All: In filling gender data gaps, there is no one-size-fits-all solution. Researchers may find that in one setting, a combination of official census data and datasets made available through mobile operators sufficiently fills data gaps and provides information which meets policymakers’ needs. In another context, satellite imagery may be most effective at highlighting under-captured dimensions of girls’ and women’s lives in under-surveyed or resource-poor areas.

Ground Truth: Big data cannot stand alone. Researchers must “ground truth,” using conventional data sources to ensure that digital data enhances, but does not replace, information gathered from household surveys or official census reviews. We can never rely solely on data sources which carry implicit biases towards women and girls who experience fewer barriers to using technology and higher rates of literacy, leaving out populations with fewer resources.

Big data offers great promise to complement information captured in conventional data sources and provide new insights into potentially overlooked populations. There is significant potential for future, inventive applications of these data sources, opening up opportunities for researchers and data practitioners to apply big data to pressing gender-focused challenges.

When actionable, context-specific, and used in tandem with existing data, big data can strengthen policymakers’ evidence base for action, fill gender data gaps, and advance efforts to improve outcomes for girls and women.

Source: cfr.org