Sphere of Influence Expands Data Analytics Studio

Sphere of Influence – a leader in value add data science for high-volume, high-velocity and high-variety information assets – today announced continued investment in its McLean, VA operations where it has doubled its data science team over the past year.  The company – which recently expanded operations into Denver, CO – is also growing its digital solutions team.

The expansion of the Sphere of Influence data science studio coincides with the ramping-up of the company’s latest offering – analytics that predict customer experience for software systems.

“Our team of data strategists, data scientists and software developers has been creating exciting innovations that will make a real difference for businesses in competitive markets,” said Sphere of Influence Director of Accounts, Scott Pringle.  “Sphere of Influence has taken the steps to bring new data science solutions to our customers and expanded our science team to position the company for exciting new growth opportunities in 2016.”

About Sphere of Influence, Inc.

Sphere of Influence fuses advanced data science with digital solutions to deliver transformative products.  The company specializes in advanced data analytics for high-volume, high-velocity and high-variety information assets from a wide range of sensors in precision agriculture, automotive, and Internet of Things (IoT) telematics.  The company utilizes a broad and continuously growing integrated infrastructure of proprietary data science platforms, algorithms, and machine learning systems.  For additional information, please visit Sphere of Influence’s corporate website at:

View live release here.



Happy Data Privacy Day – Why Don’t I Feel Safer

Happy Data Privacy Day – Now stop the hysteria.

In honor of Data Privacy Day (Jan 28th), we must point out how the hysteria over surrounding privacy has created an irrational fear that slows adoption of important technologies and actually hurts people as a result.

Privacy is a serious matter. We all know someone who has had their identity stolen. The financial loss, inconvenience, and personal violation cause identity theft to rank alongside health issues as one of the worst things that can happen to an individual. Our ever expanding digital footprint creates a target-rich environment for criminals that exposes deeply personal matters of finance and individual privacy.

Sadly, many so-called privacy advocates are exploiting this fear to insure their own relevance. They are using opportunities like “Data Privacy Day” to convince consumers to avoid “big data” and opt-out of many modern conveniences. They juxtapose modern data hungry digital services with identity theft, leaving the consumer afraid and confused. The advice is to “just say no” to all matters of digital consent, particularly if there is big data connected to a big company. They promote the idea that large corporations are looking to steal our assets and make huge profits from the details of our lives, leaving us exposed and compromised in the process.

Of course there are unscrupulous companies in today’s world that should not be entrusted with your personal information. There are many firms out there that have historically done a poor job of managing their consumer relationship. A company that positions profits before brand and consumer trust is hardly the model we strive. Instead as consumers, we should insist that these companies implement the kind of rigor that secures our personal data, maintains our privacy, contributes insight and provides consumers the means and option change their mind.
So let’s not throw the baby out with the bathwater.

The advice and fear mongering promulgated by these Ludites provides no advantage in the digital age. Suggesting that we can and should opt out of modern digital services that aggregate data is like asking us to keep our life savings in our mattress. Should we store our flash drives under our mattress or in our freezer for safekeeping where it will be safe from outside use? Those who do will suffer significant disadvantages compared to those who participate. Deprecating capital assets should be put to better use.

Analytics on Big Data opens doors and offers insights that were never before imagined. Computational analytics on large data sets changes the game in in Agriculture, Health, Automotive, Energy, and almost every other sector. Large aggregated data sets allow science to discover the weakest of signals and amplify those signals in ways that produce predictive and informed insights. That is, unless consumers are frightened into thinking the risks outweigh the rewards.

Instead of alarming consumers about the dangers of participating, we should provide consumers with the facts about big data and the role privacy plays. Privacy advocacy groups would better serve their constituents by detailing the questions people should ask and provide specific demands consumers should make of their personal data suitors.

We at Sphere of Influence have developed our Data Privacy Rules of Engagement. We believe these types of rules are a good approach for consumers who want to be sure that a request for their personal data will yield collective results without compromising their identity.

Sphere of Influence Data Privacy Rules of Engagement
1. First Do No Harm
2. Preserve the Public’s Right to Know
3. Preserve Consumer Right to be Forgotten
4. Preserve Consumer Right to be Remembered
5. Keep Relevancy Relevant

These competing rights and the privacy of the personal components of data can be accomplished through a robust application of process and technology. This in turn can keep private data private, while still allowing aggregated anonymized data to benefit consumers and society at large. These techniques are comprised of a combination of anonymization, multi-mount-point architecture, split repositories for private and service accessible data and a comprehensive service layer that only provides access to the data to which it has rights.


Big data is streaming off our vehicles, portable devices and consumer electronics portraying an important and valuable digital imprint of ourselves. There is no putting the genie back in the bottle. Rather than use hysteria to make consumers run for the hills, we should accept the reality of today’s digital world, embrace the opportunity for advancement in science and insist on a comprehensive approach to data privacy from companies that use them.

Chris Burns, Director
Sphere of Influence Software Studios
-A Premium Analytics Company

Dealing with your crappy data

Let’s confront a real problem in Big Data. Inside your data warehouse lurk errors that could potentially render your data as useless as a thousand disks packed with random numbers. With all the hype in the industry around storage, transfer, data access, point-and-click analysis software, etc., more emphasis should be placed on detecting, evaluating, and mitigating data errors.

If you are thinking, “my data contains no errors”, you are living in denial. Unless you are storing output produced by a closed-form mathematical expression or a random number generator, you should recognize that any sensor or data collector — whether an industrial process temperature sensor, a bar code reader, a social media sentiment analyzer, or a human recording a customer transaction — generates some error along with the true value.

In its basic form, the data residing in your data warehouse, the “measured value”, is a combination of the true value plus some unknown error component:


Measured Value = True Value + Unknown Error


This expression is valid for continuous variables where differences can be quantified (like position, account balance, number of miles between customer and retail store, etc.) It is also valid for categorical variables, discrete values that can be ordered (like level of income, or age category), or unordered (like product category or name of sales associate).

Now, if we knew the error for each measured value, we could simply subtract it and restore the true value. But, unfortunately, that is usually not possible (otherwise, we would just store the true value right away), so we have to characterize the error as an uncertainty:


True Value = Measured Value ± Uncertainty


Clearly, if the error is significant, it can hide the true value, resulting in useless and potentially misleading, deeply flawed analytics. One of the fundamental responsibilities of a data scientist is to characterize the uncertainty in the data due to the error, and know (or estimate) when the error could contribute to faulty analysis, incorrect conclusions, and ultimately bad decisions.

Leeds General Infirmary in England shut down its children’s heart surgery ward for eleven days in March 2013 as errors and omissions in patient data lead the hospital’s directors to incorrectly conclude that the child mortality rate was almost twice the national average. We assess that had these data errors been properly detected and mitigated, the results of the analysis would have been accurate, and the decision to shut down the ward would not have occurred, and more patients could have been treated. Source:


So, what are the types of errors in my data?

Data errors occur in all shapes and sizes, and uncertainty analysis should consider both the type of error, and the relative magnitude. A multitude of methods of error mitigation are available, but they are only effective for specific types of error, and could even enhance the error data if applied carelessly. Let’s decompose the unknown error into random errors and systematic errors, because the ways of handling them are very different:

Measured Value = True Value + Random Error + Systematic Error


Noise — the Random Errors

Suppose a sales associate repeatedly types in customers’ names, phone numbers, and addresses, and every once in a while, makes a typo – a replacement, deletion, or insertion error. But any previous typo has no influence on the next typo; these typos are assumed to be independent of each other. Or suppose, you are recording temperature at various times and places in an industrial plant, where each sensor, reports a variation about an average, and the average is an accurate representation of the true temperature for a time and location. In both examples, the errors have no sequential pattern and are entirely random, or noisy.

The nice thing about random error is that they are readily handled by general-purpose statistical methods that you can find in mainstream statistical software packages. Noise can be estimated and smoothed out, with simple averages and standard deviations, or with more advanced filtering or predicting techniques, such as Kalman Filters or Spectral Analysis (such as those based on Fourier or Wavelet Transforms). Other noise sources, such as the random typos above, can be readily fixed using statistical parsers or spelling or grammar checking techniques.


Bias — the Systematic Errors

Just like a wristwatch that is running ahead by three minutes, systematic errors maintain consistent patterns from measurement to measurement. An important distinction between random errors and systematic errors is that while random errors can be handled statistically, systematic errors cannot (Taylor, 1997).

Unfortunately, biases – fixed or drifting offset errors – tend to work their way into most measurements from all types of data collectors, from uncalibrated sensors to inadequately sampled survey populations. In fact, the only real way to detect or eliminate bias is to compare with some form of truth reference. For example, one of our customers presented us with data sets of physical measurements from heavy land machinery, where each machine reported large quantities of data with different biases. We quantified bias characteristics by comparing the machinery measurements against a validated baseline for a limited subset of the data. Then, by applying relaxation algorithms, we were able to minimize the bias errors relative to the machines.

Suppose we survey a consumer group about product preferences from urban communities in one large metropolitan area. Even with very large sample sizes, are these results meaningful for rural populations? Are they representative of other cities throughout the US? If we gathered even more responses from the same city, would our data be a better approximation for the US? The answer is most likely no in each case, because the data is biased toward the preferences of the sampled population.

Another representative example of sampling biases that we see regularly – based on the Nyquist criterion — suppose we want to compute the slopes along a particular road from recorded GPS elevation measurements at regular quarter-mile intervals. The obvious problem with this approach is that any slope between two hilltops separated by less than half a mile, will be aliased – the slopes will appear much smaller than they are in reality. (This is the same type of problem that causes the wagon wheels to appear to rotate backwards in old western movies.)

We regularly see such biases working themselves undetected into analytics that could lead to bad decisions. In our experience, detecting and mitigating bias is much more challenging than dealing with random error, because it requires an intimate knowledge of the domain and because standard statistical methods are not generally applicable.


How to make your crappy data useful

Now that we have described how error can dramatically reduce the utility of your data, what should one do to mitigate its bad influence on analytics?

First, know your data and quantify its uncertainty. Understand the conditions and environment under which the data is collected, then for a representative part of your data, find a trustworthy baseline to compare against. Using the baseline to “reverse-engineer” the errors, quantify the random and the systematic errors separately, realizing that the mitigation techniques will be quite different. Describe the spread of the random error with the common measures, such as the standard deviation; describe the systematic fixed or varying offset errors, as means and slopes over sequential segments in the data. It is especially important to characterize the uncertainty when merging in new data sources to ensure that the new data doesn’t enhance the errors significantly.

Second, understand how error can affect your analysis. We frequently use a simulation for a sensitivity analysis, starting with an error-free condition and progressively increase the random and systematic errors, until we detect a significant reduction in performance. Suppose we have a model that predicts whether an automobile service customer will return to the service bay, in part based on the distance the customer lives from the service bay. We can then insert different error conditions on the distance variable, and empirically determine when the model fails to predict customer behavior reliably.

Third, apply error mitigation as a preprocessing stage. From our experience, many analytic tools, such as classifiers can perform better when the random error is smoothed out. Unmitigated biases can propagate inconsistent data features into downstream analytics, so it useful to first determine the regions in the data that are potentially affected by bias. Assuming the high-bias regions can be identified, they can be excluded if it is not possible to mitigate the bias error. Detection and mitigation of bias is specifically tailored to the type of data and the method of collection.

So, how crappy is your data?

Do you know if the errors are affecting your results, and providing potentially flawed “insights”? Are you tracking the noise or the signal? Is your data so corrupted by error that any advanced analytics lead to contradictory conclusions? If so, you may need to refocus your corporate data strategy on more enhanced error characterization and mitigation techniques.

Wake up! It’s the insider threat you need to worry about



Edward Snowden is the new face of the insider threat, the media even calls him the “Ultimate Insider Threat”.  This is someone who has the highest-level security clearance, endures a background reinvestigation every 5 years, takes a polygraph exam, and still betrays his sacred oath and trust of his employers.
When it comes to asserting workforce trustworthiness, industry and government are both guilty of over-relying on employment pre-screening, background investigations, and oaths.  These are effective to a degree and good first steps but obviously inadequate when it comes to preventing losses and breaches.

Insider threats are detectable because they don’t behave exactly like everyone else.  Maybe on the surface these people appear to be the same as their coworkers, but at some level their behaviors are different.  A sensitive enough instrument can detect such subtle differences in behavior, and if the noise of anomalies can be removed then high-quality actionable alerts can be generated from the “unusual anomalies”.  This is the basis of the Insider Threat Detection technology that has been developed by Sphere of Influence over the past two years.

The problem isn’t cyber-security, which is focused on the threat of digital attacks against digital assets.  This is an industrial security threat, where a person of trust betrays that trust and misuses access to cause deep harm or substitute a third-party agenda.  Unlike cyber-attacks, an effective insider might not even use your digital assets as the vehicle for attack or exfiltration, they might steal files from a safe or do other things.  However, if a person’s normal behavioral modalities change even slightly then shadows of those changes are often reflected how they use the computer, thus computer activity can yield a behavioral profile for an individual, even if the actual threatening behavior is more analog than digital.

By connecting a sensitive behavioral profiling instrument to a network we can construct individual profiles that are accurate enough to perform this type of anomaly detection. Such algorithm-synthesized profiles apply to human and non-human users of a network, giving some cyber-security crossover to this approach in addition to the industrial security focus. However, Insider Threat Detection is not cyber-security, it is industrial security that uses cyber-technology as a sensor.

In our case the goal of this technology is to detect the active insider threat early in the activity cycle. We believe strongly that there is no way to fully prevent insider threats from occurring because no background screening process on Earth will ever accomplish that. To defend against the insider we believe early detection of active threat behaviors is the key to loss prevention.
This is possible thanks to Advanced Data Analytics (Analytics 2.0) techniques which evaluate dozens (or even thousands) of simultaneous feature dimensions on Big Data under a powerful layer of unsupervised machine learning. What makes insider threat detection different from conventional Analytics 2.0 is that it must work on streaming data, in real-time, and at-scale.

At Sphere of Influence, because we have been so invested in Advanced Data Analytics these past few years, we were able to solve these problems and invented an instrument that does what I describe here.  We use it every day on our networks and it is already installed at beta customers, primarily law offices.

The bottom line is that even the most intense background checks are not good enough, you need to be able to detect insider threats when they become active and before those threats move to Hong Kong.