A Blockchain Case Study: Building a Derivative Marketplace with Ethereum – Part I

Blockchain is a revolutionary paradigm that will reshape every aspect of our lives. The first part of this series will present a high-level overview of the underpinnings of blockchain technology and develop some intuition for smart contracts. After finishing this post, readers will have a strong understanding of blockchain fundamentals and understand the term smart contract. Part two of this series will explore the development of a centralized derivative marketplace for US equities numerated in Ether. Part three will present an alternative marketplace construction utilizing smart contracts written on the Ethereum network.

Before we jump into an explanation of blockchain, let’s begin with a bit of history. The blockchain revolution began in 2009 when Satoshi Nakamoto published the reference implementation for Bitcoin. Interestingly, the identity of Satoshi Nakamoto is not known. Nakamoto’s work synthesized the concepts of distributed ledgers, public key cryptography, and so-called proof of work, thereby making a digital currency possible. In 2013, Ethereum was hailed as Bitcoin 2.0 when the Ethereum Foundation built a programming language into a blockchain framework. This combination allowed for the construction of so-called smart contracts.

So how does a blockchain work? Assume we have a collection of networked computers running a blockchain client (a piece of software). Transactions are broadcast to all participants on the network and cryptographic techniques ensure that transactions from account A to B were transmitted by the owner of account A. That is, I can send my friend Tony 10 coins, but Tony is not able to generate a transaction of 10 coins from my account. These concepts, when combined with the notion of proof of work, establish the necessary components for a decentralized, cryptographically-secured transaction network that incentivizes network participants to process transactions. Proof of work, or mining, is a complex process, but the idea is relatively simple. Every computer on the network receives a copy of each transaction. The computers on the network play a game where they bundle transactions together and attempt to solve a challenging cryptography problem using these transactions. If they successfully find the answer to this problem, they can add X coins to their account and broadcast their block to the network. Participants on the network verify the block contains a valid solution (proof of work) and then add this block to their chain; hence, blockchain.

Every block contains a certain number of transactions. By beginning at the first block, the genesis state, and successively applying the transactions in each block, it is possible to regenerate the state of the blockchain at any point in time. A blockchain is analogous to a bank statement where your account balance increases and decreases over time as new transactions are processed. However, unlike a traditional bank statement, every participant on the network has access to your transaction history. This does not present a security risk because individuals are not able to identify you from your account number. Furthermore, every computer on the network retains a copy of this chain, making it nearly impossible for a malicious party to alter the chain in any way.

Initial blockchain ledgers such as Bitcoin only allowed transactions and small messages to be stored on the distributed ledger. The following innovation in blockchain technology came to be known as smart contracts. Smart contracts are blocks of code stored on the blockchain that contain instructions and, possibly, account balance(s). Rather than provide an obscure definition for smart contract, let us see an example. Suppose you wish to place a bet on a sporting event with a person you do not fully trust. Rather than hope he or she makes good on their end of the bet, both parties could deposit, say, 10 coins into a smart contract on the blockchain network. This smart contract can serve as an escrow account and be programmed to send the winnings to the appropriate party following the game. For example, the contract code could query the result of the game from a predefined source and transfer the funds to the appropriate party at a future time. Hence, smart contracts can replace a trusted intermediary. This makes transacting value across the internet secure, robust, flexible, and cost-effective. In part two, we will explore the construction of a centralized derivative marketplace for US equities.

A Blockchain Case Study: Building a Derivative Marketplace with Ethereum – Part II

Part one of this blog series provided an overview of blockchain fundamentals and introduced smart contracts. Readers unfamiliar with this material are encouraged to first read part one. This post will explore construction of a centralized marketplace for binary derivatives. This approach does not make effective use of the blockchain, but is presented to illustrate the numerous challenges that must be overcome when taking a centralized approach. In part three, an elegant alternative is presented which employs smart contracts deployed on the Ethereum network to facilitate transactions between parties. For the remainder of this series, blockchain will refer to the Ethereum blockchain and the underlying cryptocurrency is Ether (plural: Ether).

A centralized exchange suffers from the very requisite blockchains were designed to replace: trust. Upon account creation, customers must deposit Ether into their account(s). In other words, they must give the exchange their Ether. Customers must have confidence that the exchange will safeguard their coins and settle contracts properly. If this trust is compromised, customers will not use our exchange.

When customers deposit their Ether into our exchange, the customer(s) are sending a transaction from their personal account(s) to the exchange’s account. Indeed, the exchange can be represented by a single account on the Ethereum blockchain. The accounts that exist within the exchange only exist on the computers at the exchange and are not included on the blockchain ledger. As a result, customers have had little recourse when counterfeit exchanges accept deposits. The establishment of trust is a crucial component of the centralized approach. Large exchanges such as Bittrex, Kraken, and GDAX have spent millions of dollars on both infrastructure and marketing to develop credibility.

Once we have spent millions of dollars on infrastructure and developed widespread-credibility across the internet, our exchange is ready for business (I never said this approach was easy). Our exchange will trade binary derivatives on US equities. A binary derivative can be viewed as a bet on the future price of a stock. Suppose the price per share of Google Inc. is $950.00 and I make the following bet with a friend: I’ll pay you 10 Ether if the stock price of Google Inc. closes above $950.005 next Friday, otherwise you’ll pay me 10 Ether. The outcome is binary in that the price will close above or below $950.005 (it cannot close at $950.005). In finance, the term derivative is reserved for instruments that derive their value from the price of another asset. To understand why this contract would derive its price from the stock, suppose we made the same bet, but the price of Google was $1000.00 per share. Given the share price is well above our agreed upon (strike) price of $950.005, there is a pretty good chance that I’m going to win this bet. My friend realizes this, and proposes the following modification to our agreement: You’ll pay me 10 Ether if the stock price of Google Inc. closes below $950.005 on next Friday, otherwise I’ll pay you 1 Ether. Essentially, my friend has decreased the price he is willing to pay because his risk of losing his investment has increased. If I win the bet, I receive back my 10 Ether plus his 1 Ether. If the price of Google Inc. dives below $950.005, my friend will receive back his investment plus my 10 Ether.

Our exchange will enable thousands of individuals to place bets on the future price of various US equities. What makes derivative instruments fascinating, is that anyone can to buy or sell contracts. If a contract does not exist with your desired specifications, you can create it! For example, a customer could create a contract at our exchange that offers to pay 5 Ether if Google Inc. climbs above $980.005 by the end of the month, otherwise they wish to be paid 1 Ether. If a customer agrees to these terms, our exchange will facilitate the transaction between these customers (perhaps for a small fee).

In part three we will modify the above approach to remove the requisite of trust from our exchange. We will see how smart contracts facilitate a decentralized market where two parties can enter a contractual agreement and be assured the other party will make good on their contractual obligations.

A Blockchain Case Study: Building a Derivative Marketplace with Ethereum – Part III

Alas, we now possess the knowledge to build our decentralized marketplace! If you’ve come this far, you should have a solid grasp of how a blockchain operates and possess a conceptual understanding of smart contracts. In this final post, we synthesize these concepts and develop a decentralized marketplace that overcomes many of the challenges posed by a centralized exchange. Parts one and two can be accessed here and here, respectively.

Up to this point, a formal definition of smart contract has been avoided because providing a definition without context is oftentimes more confusing than helpful. For the duration of this post we will make use of the following definition. A smart contract is an object that exists on the blockchain which contains state variables and functions that are triggered by messages. If that doesn’t quite make sense yet, don’t worry. Our binary derivative will be represented by a smart contract with the following state variables.

  • Address and wager of writer – Account number of the individual creating the contract and the amount the writer is wagering.
  • Address and wager of buyer – These both default to none. When a buyer accepts the contract terms, their account number and wager are updated.
  • Ticker symbol – Indicates the stock being considered (i.e. GOOG).
  • Strike price – Threshold price for the contract. This should be specified using three decimals to ensure the price does not close on the strike price.
  • Stance – Bull indicates the writer of the contract believes the stock price will close below the strike price. Bear indicates the opposite.
  • Required leverage – This will be discussed below.
  • Expiration date – Contract code will execute shortly after 4:00 PM on this date.

Our smart contract serves as an escrow account. When two parties agree to a wager, they deposit funds into this contract. Both parties do not need to wager the same amount because the odds of winning may favor one party over the other. To make the above components more concrete, let’s look at an example with the following state variables.

  • Address and wager of writer – d26114cd6EE289AccF82350c8d8487fedB8A0C07 / 0.6 Ether
  • Address and wager of buyer – None / None
  • Ticker symbol – GOOG
  • Strike price – $950.005
  • Stance – Bear
  • Required leverage – 5
  • Expiration date – next Friday

Let’s assume the price of Google Inc. is $975.00 per share. My Ethereum account address is d26114cd6EE289AccF82350c8d8487fedB8A0C07 (please send Ether :D) and I am betting 0.6 Ether that the price of Google Inc. closes below $950.005 next Friday. A minimum leverage of 5 indicates that the buyer of my contract must wager at least (5 x 0.6) 3 Ether. This isn’t as contrived as it first appears. The price is Google is currently above the strike price, so the odds the price falls below $950.005 are not in my favor. Thus, to compensate for this risk, I require a 5-fold return on investment.

When this smart contract is deployed on the Ethereum network, the contract is assigned an account number. I could send this account number to an individual and they would be able to place his or her counter-wager of 3 Ether. Next Friday, I message the close price of Google to the contract and the contract code would transfer the funds to the appropriate account. This process removes the necessity of a trusted, centralized party, but suffers from the following problems.

  • The buyer of the contract could be tricked into sending money to a counterfeit contract or, potentially, someone else’s personal account.
  • The writer of the contract could broadcast an incorrect close price for the stock to the smart contract to cheat the system.

The above issues can be remedied by creating a centralized marketplace where users trade decentralized smart contracts. Note, a centralized marketplace does not suffer from the issues inherent to our centralized exchange. The site will allow users to post the address of their smart contract(s) to locate potential buyers. Prior to publishing the address of a contract on the site, our server(s) will query the Ethereum blockchain to ensure the supplied address is associated to a valid contract. Finally, to ensure the writer or buyer is unable to cheat the contract by transmitting an invalid stock price, the published contract must be coded to only accept messages from our server address. By constructing a centralized marketplace atop a distributed ledger, we can offer our customers an excellent user experience while retaining the unparalleled trust and credibility of a blockchain.

To conclude, blockchain technology is still in its infancy and its impacts and applications are far from certain. However, it is certain that blockchain will reshape our lives in the coming years. Never has a technology promised to be so disruptive. Oftentimes large companies retain market share for the sole reason that they are trusted by customers. By replacing long-established trust with a technological paradigm, customers will be able to trust companies and individuals in ways that have never have never been possible. We are entering a new era of commerce. Every company is at risk of being replaced by a trusted version of itself [1].

[1] Ritche Etwaru – May 17, 2017 at TEDxMorristown

Happy Data Privacy Day – Why Don’t I Feel Safer

Happy Data Privacy Day – Now stop the hysteria.

In honor of Data Privacy Day (Jan 28th), we must point out how the hysteria over surrounding privacy has created an irrational fear that slows adoption of important technologies and actually hurts people as a result.

Privacy is a serious matter. We all know someone who has had their identity stolen. The financial loss, inconvenience, and personal violation cause identity theft to rank alongside health issues as one of the worst things that can happen to an individual. Our ever expanding digital footprint creates a target-rich environment for criminals that exposes deeply personal matters of finance and individual privacy.

Sadly, many so-called privacy advocates are exploiting this fear to insure their own relevance. They are using opportunities like “Data Privacy Day” to convince consumers to avoid “big data” and opt-out of many modern conveniences. They juxtapose modern data hungry digital services with identity theft, leaving the consumer afraid and confused. The advice is to “just say no” to all matters of digital consent, particularly if there is big data connected to a big company. They promote the idea that large corporations are looking to steal our assets and make huge profits from the details of our lives, leaving us exposed and compromised in the process.

Of course there are unscrupulous companies in today’s world that should not be entrusted with your personal information. There are many firms out there that have historically done a poor job of managing their consumer relationship. A company that positions profits before brand and consumer trust is hardly the model we strive. Instead as consumers, we should insist that these companies implement the kind of rigor that secures our personal data, maintains our privacy, contributes insight and provides consumers the means and option change their mind.
So let’s not throw the baby out with the bathwater.

The advice and fear mongering promulgated by these Ludites provides no advantage in the digital age. Suggesting that we can and should opt out of modern digital services that aggregate data is like asking us to keep our life savings in our mattress. Should we store our flash drives under our mattress or in our freezer for safekeeping where it will be safe from outside use? Those who do will suffer significant disadvantages compared to those who participate. Deprecating capital assets should be put to better use.

Analytics on Big Data opens doors and offers insights that were never before imagined. Computational analytics on large data sets changes the game in in Agriculture, Health, Automotive, Energy, and almost every other sector. Large aggregated data sets allow science to discover the weakest of signals and amplify those signals in ways that produce predictive and informed insights. That is, unless consumers are frightened into thinking the risks outweigh the rewards.

Instead of alarming consumers about the dangers of participating, we should provide consumers with the facts about big data and the role privacy plays. Privacy advocacy groups would better serve their constituents by detailing the questions people should ask and provide specific demands consumers should make of their personal data suitors.

We at Sphere of Influence have developed our Data Privacy Rules of Engagement. We believe these types of rules are a good approach for consumers who want to be sure that a request for their personal data will yield collective results without compromising their identity.

Sphere of Influence Data Privacy Rules of Engagement
1. First Do No Harm
2. Preserve the Public’s Right to Know
3. Preserve Consumer Right to be Forgotten
4. Preserve Consumer Right to be Remembered
5. Keep Relevancy Relevant

These competing rights and the privacy of the personal components of data can be accomplished through a robust application of process and technology. This in turn can keep private data private, while still allowing aggregated anonymized data to benefit consumers and society at large. These techniques are comprised of a combination of anonymization, multi-mount-point architecture, split repositories for private and service accessible data and a comprehensive service layer that only provides access to the data to which it has rights.


Big data is streaming off our vehicles, portable devices and consumer electronics portraying an important and valuable digital imprint of ourselves. There is no putting the genie back in the bottle. Rather than use hysteria to make consumers run for the hills, we should accept the reality of today’s digital world, embrace the opportunity for advancement in science and insist on a comprehensive approach to data privacy from companies that use them.

Chris Burns, Director
Sphere of Influence Software Studios
-A Premium Analytics Company

Too Little, Too Late – Morgan Stanley could have prevented the Data Leak


In a recent article about the Morgan Stanley insider theft case, Gregory Fleming, the president of the wealth management arm said:

“While the situation is disappointing, it is always difficult to prevent harm caused by those willing to steal”

Disappointing?  350,000 clients were compromised, the top 10% of investors, and this following a breach that left 76 million households exposed.

Morgan Stanley fired one employee

The fact is, this breach was preventable. Firms like Morgan Stanley are remiss in allowing these to occur, and are adding to the problem by perpetuating the myth that they cannot be stopped.  The minimal approach of repurposing perimeter cyber security solutions does not work.  These perimeter solutions and practices have been in place in each case of insider breaches including the U.S. government (i.e. Bradley Manning, Edward Snowden), Goldman Sachs, and the multiple Morgan Stanley breaches.  Even Sony Entertainment had some intrusion protection in place.  Cyber security professionals remain one step behind the criminals in defining events, thresholds, and signatures – none of these are effective for the insider.

Building behavioral profiles for all employees, managers, and executives using objective criteria is the best, and possibly the only, feasible way to catch the insider.  Current approaches that focus the search for malicious insiders based on the appropriateness of web sites, or the stability of an employee based on marital situations seem logical, but provide little value.  There are a lot of people that get divorces that do not steal from their employers or their country.

Rules and thresholds defined by human resource and cybersecurity professionals have proven ineffective at stopping the insider.  Data analytics using unsupervised machine learning on a large, diverse dataset is essential.  Sphere of Influence developed this technology and created the Personam product and company.

Personam catches insiders before damaging exfiltrations.  It is designed for the insider threat, both human and machine based, and has a proven record of identifying illegal, illicit, and inadvertent behaviors that could have led to significant breaches.

The malicious insider can be caught, and it is time to take the threat seriously and time to stop giving firms like Morgan Stanley (and Sony) a pass on their unwillingness to address the fact that they have people on the inside willing to do harm to their clients, their company, and in some cases, our country.

Dealing with your crappy data

Let’s confront a real problem in Big Data. Inside your data warehouse lurk errors that could potentially render your data as useless as a thousand disks packed with random numbers. With all the hype in the industry around storage, transfer, data access, point-and-click analysis software, etc., more emphasis should be placed on detecting, evaluating, and mitigating data errors.

If you are thinking, “my data contains no errors”, you are living in denial. Unless you are storing output produced by a closed-form mathematical expression or a random number generator, you should recognize that any sensor or data collector — whether an industrial process temperature sensor, a bar code reader, a social media sentiment analyzer, or a human recording a customer transaction — generates some error along with the true value.

In its basic form, the data residing in your data warehouse, the “measured value”, is a combination of the true value plus some unknown error component:


Measured Value = True Value + Unknown Error


This expression is valid for continuous variables where differences can be quantified (like position, account balance, number of miles between customer and retail store, etc.) It is also valid for categorical variables, discrete values that can be ordered (like level of income, or age category), or unordered (like product category or name of sales associate).

Now, if we knew the error for each measured value, we could simply subtract it and restore the true value. But, unfortunately, that is usually not possible (otherwise, we would just store the true value right away), so we have to characterize the error as an uncertainty:


True Value = Measured Value ± Uncertainty


Clearly, if the error is significant, it can hide the true value, resulting in useless and potentially misleading, deeply flawed analytics. One of the fundamental responsibilities of a data scientist is to characterize the uncertainty in the data due to the error, and know (or estimate) when the error could contribute to faulty analysis, incorrect conclusions, and ultimately bad decisions.

Leeds General Infirmary in England shut down its children’s heart surgery ward for eleven days in March 2013 as errors and omissions in patient data lead the hospital’s directors to incorrectly conclude that the child mortality rate was almost twice the national average. We assess that had these data errors been properly detected and mitigated, the results of the analysis would have been accurate, and the decision to shut down the ward would not have occurred, and more patients could have been treated. Source: http://www.bbc.com/news/health-22076206


So, what are the types of errors in my data?

Data errors occur in all shapes and sizes, and uncertainty analysis should consider both the type of error, and the relative magnitude. A multitude of methods of error mitigation are available, but they are only effective for specific types of error, and could even enhance the error data if applied carelessly. Let’s decompose the unknown error into random errors and systematic errors, because the ways of handling them are very different:

Measured Value = True Value + Random Error + Systematic Error


Noise — the Random Errors

Suppose a sales associate repeatedly types in customers’ names, phone numbers, and addresses, and every once in a while, makes a typo – a replacement, deletion, or insertion error. But any previous typo has no influence on the next typo; these typos are assumed to be independent of each other. Or suppose, you are recording temperature at various times and places in an industrial plant, where each sensor, reports a variation about an average, and the average is an accurate representation of the true temperature for a time and location. In both examples, the errors have no sequential pattern and are entirely random, or noisy.

The nice thing about random error is that they are readily handled by general-purpose statistical methods that you can find in mainstream statistical software packages. Noise can be estimated and smoothed out, with simple averages and standard deviations, or with more advanced filtering or predicting techniques, such as Kalman Filters or Spectral Analysis (such as those based on Fourier or Wavelet Transforms). Other noise sources, such as the random typos above, can be readily fixed using statistical parsers or spelling or grammar checking techniques.


Bias — the Systematic Errors

Just like a wristwatch that is running ahead by three minutes, systematic errors maintain consistent patterns from measurement to measurement. An important distinction between random errors and systematic errors is that while random errors can be handled statistically, systematic errors cannot (Taylor, 1997).

Unfortunately, biases – fixed or drifting offset errors – tend to work their way into most measurements from all types of data collectors, from uncalibrated sensors to inadequately sampled survey populations. In fact, the only real way to detect or eliminate bias is to compare with some form of truth reference. For example, one of our customers presented us with data sets of physical measurements from heavy land machinery, where each machine reported large quantities of data with different biases. We quantified bias characteristics by comparing the machinery measurements against a validated baseline for a limited subset of the data. Then, by applying relaxation algorithms, we were able to minimize the bias errors relative to the machines.

Suppose we survey a consumer group about product preferences from urban communities in one large metropolitan area. Even with very large sample sizes, are these results meaningful for rural populations? Are they representative of other cities throughout the US? If we gathered even more responses from the same city, would our data be a better approximation for the US? The answer is most likely no in each case, because the data is biased toward the preferences of the sampled population.

Another representative example of sampling biases that we see regularly – based on the Nyquist criterion — suppose we want to compute the slopes along a particular road from recorded GPS elevation measurements at regular quarter-mile intervals. The obvious problem with this approach is that any slope between two hilltops separated by less than half a mile, will be aliased – the slopes will appear much smaller than they are in reality. (This is the same type of problem that causes the wagon wheels to appear to rotate backwards in old western movies.)

We regularly see such biases working themselves undetected into analytics that could lead to bad decisions. In our experience, detecting and mitigating bias is much more challenging than dealing with random error, because it requires an intimate knowledge of the domain and because standard statistical methods are not generally applicable.


How to make your crappy data useful

Now that we have described how error can dramatically reduce the utility of your data, what should one do to mitigate its bad influence on analytics?

First, know your data and quantify its uncertainty. Understand the conditions and environment under which the data is collected, then for a representative part of your data, find a trustworthy baseline to compare against. Using the baseline to “reverse-engineer” the errors, quantify the random and the systematic errors separately, realizing that the mitigation techniques will be quite different. Describe the spread of the random error with the common measures, such as the standard deviation; describe the systematic fixed or varying offset errors, as means and slopes over sequential segments in the data. It is especially important to characterize the uncertainty when merging in new data sources to ensure that the new data doesn’t enhance the errors significantly.

Second, understand how error can affect your analysis. We frequently use a simulation for a sensitivity analysis, starting with an error-free condition and progressively increase the random and systematic errors, until we detect a significant reduction in performance. Suppose we have a model that predicts whether an automobile service customer will return to the service bay, in part based on the distance the customer lives from the service bay. We can then insert different error conditions on the distance variable, and empirically determine when the model fails to predict customer behavior reliably.

Third, apply error mitigation as a preprocessing stage. From our experience, many analytic tools, such as classifiers can perform better when the random error is smoothed out. Unmitigated biases can propagate inconsistent data features into downstream analytics, so it useful to first determine the regions in the data that are potentially affected by bias. Assuming the high-bias regions can be identified, they can be excluded if it is not possible to mitigate the bias error. Detection and mitigation of bias is specifically tailored to the type of data and the method of collection.

So, how crappy is your data?

Do you know if the errors are affecting your results, and providing potentially flawed “insights”? Are you tracking the noise or the signal? Is your data so corrupted by error that any advanced analytics lead to contradictory conclusions? If so, you may need to refocus your corporate data strategy on more enhanced error characterization and mitigation techniques.

If computers can beat Jeopardy! champions, why can’t they detect the insider threat?

The world was awed two years ago when IBM’s Watson defeated Jeopardy! champions Brad Rutter and Ken Jennings. Watson’s brilliant victory reintroduced the potential of machine learning to the public. Ideas flowed, and now this technology is being applied practically in the fields of healthcare, finance and education. Emulating human learning, Watson’s success lies in its ability to formulate hypotheses using models built from training questions and texts.


Three years ago, Army Private First Class Bradley Manning leaked massive amounts of classified information to WikiLeaks and brought to public awareness the significance of data breaches. In response to this and several other highly publicized data breaches, government committees and task forces established recommendations and policies, and invested heavily in cyber technologies to prevent such an event from reoccurring. Surely, we thought, if anyone had the motivation and resources to get a handle on the insider threat problem, it is the government. But, Edward Snowden, who caused the recent NSA breach, has made it painfully obvious how impotent the response was.


Lest we assume this is a just government problem, enormous evidence abounds showing how vulnerable commercial industry is to the insider. We are inundated with a flood of articles describing how malicious insiders have cost private enterprise billions of dollars in lost revenue, so why has no one offered a plausible solution?


The insider threat remains an unmitigated problem for most organizations, not because the technologies do not exist, but rather because the cyber defense industry is still attempting to discover the threat using a rules-based paradigm. Virtually all cyber defense solutions in the market today apply explicit rules, whether they are antivirus programs, firewalls with access control lists, deep packet inspectors, or protocol analyzers. This paradigm is very effective in defending against known malware and network exploits, but fails utterly when confronted with new attacks (i.e. “zero-days”) or the surreptitious insider.


In contrast, acknowledging that it was impossible to build a winning system that relied on enumerating all possible questions, IBM designed Watson to generalize and learn patterns from previous questions and use these models to hypothesize answers to novel questions. The hypothesis with the highest confidence was selected as the answer.


Like Watson, an effective technology to detecting the insider must adaptively learn historical network patterns and then use those patterns to automatically discover anomalous activity. Such anomalous traffic is symptomatic of unauthorized data collection and exfiltration.


Inspired by the WikiLeaks incident, Sphere’s R&D team has investigated machine learning algorithms that construct historical models by grouping users by their network fingerprints. As an example, without any rules or specifications, the algorithms learn that bookkeeping applications transmit a distinctive pattern that enables grouping accountants together, and HR professionals are grouped by the recruiting sites they visit. These behavioral models generalize normal activity and can be used as templates to detect outliers. While users commonly generate some outliers, suspicious users deviate significantly from their cohorts, such as the network administrator that accesses the HR department’s personnel records. Like Watson, the models allow the system to form hypotheses.


Applied to cyber security, every time an entity accesses the network, the algorithms hypothesize if the activity conforms to its model. If it does not conform, that activity is labeled an outlier. Because these methods use a statistical confidence that dynamically balances internal thresholds on network activities (e.g., sources and destinations, direction and amount of data transferred, times, protocols, etc.), it becomes extremely hard for a malicious insider to outsmart. Simply the fact that the system does not reveal its thresholds can have a significant deterrent effect.


A paradigm shift in cyber technologies is happening now. Cyber security professionals agree that preventing data breaches from a malicious insider is a difficult task, and the past suggests that next major breach will not be detected with existing rules-driven cyber defense solutions. Next generation cyber security technology developers must seek inspiration from IBM’s Watson and other successful implementations of machine learning before we can hope to prevail against the insider threat.



Why the Government Insider Threat Program Will Fail

President Obama has ordered federal employees to monitor the behavioral patterns of coworkers and report suspicious observations to officials.  Under this policy a coworker’s failure to cast suspicion on another coworker could result in penalties that include criminal charges.

Seriously! This is the current policy for preventing the next insider threat, to pit coworker against coworker!

Well…interestingly enough, they are half-right. Behavior profiling is the only way to identify an insider threat. Typically these “threats” are clever people who conceal a hidden agenda, often in plain sight. If a trusted insider is careful, as both Bradley Manning and Edward Snowden were, then we shouldn’t expect to catch them in the act of stealing, spying, or exfiltrating. They will do their jobs normally, act normally, and do nothing careless that would alert suspicion. Of course, that’s just on the surface. There will always be little behaviors these people can’t control that are different from “normal” because, let’s face it, they are different from normal coworkers. Insider threats have a secret agenda and the burden of carrying whatever motivates them to embrace that agenda. They might be good fakers, but at some level they are different, and those differences will manifest in behavior – maybe not in big things, but in little things they do every day.  If it were possible to monitor their behaviors with a sensitive enough instrument then, theoretically, we could detect the fact they are different and isolate “suspicious” differences from “normal” differences.

Of course the experts in the field (behavior psychologists and security researchers) have no idea what constitutes suspicious behavior.  Heck, in any given group of workers we don’t even know what we should consider normal, let alone suspicious.  If you typically print 20 pages per week and suddenly have a week where you print 100 pages, is that evidence you are the insider threat or were you just assigned a big presentation where lots copies are needed? If someone sends you a link to a file or website that is unrelated to your normal work, is opening that link or downloading that file evidence you are a threat?  Perhaps, but probably 99.999% of the time the answer is no.

Fooled by Randomness

The problem with the Government’s Insider Threat Program is it asks squishy human beings to be the sensor, the profiler, and the alarm. Suddenly coworkers are jotting down notes when a cubemate takes an unusual number of bathroom breaks.  Is she the next Edward Snowden or is she pregnant?  It’s left to an individual’s imagination to consider what is “normal” vs. “abnormal”.  Naturally, people will inject identity and cultural bias, they will show favor to coworkers whom they like and show disfavor to those they dislike, office politics will weigh in, and people will err when attempting to read suspicion into normal events.  Nassim Nicholas Taleb’s great book, “Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets” illustrates so clearly how human beings are easily fooled into seeing causality when there is only correlation, or at misreading the presence of correlation.  Behaviors one might think are clear indicators of suspicion, like printing 100 pages when 20 is the norm, are just part of the everyday “jitter” in the normal behavior of individuals, departments, and organizations.


False Positives


Behavior profiling results in a tremendous number of false positives, i.e. false accusations.  The experts don’t know what behaviors to monitor, there is no proper baseline for “normal”, and no objective way of discerning whether a novel behavior is threatening or benign.  Moreover, because differences and novelty stand out simply because they are different – humans are biased toward labeling such as suspicious.  Imagine an already clogged government bureaucracy that is further impeded by a flood of false accusations, each of which requiring some non-trivial investigation in order to clear the names of good people.  Also imagine that the actual insider threat, the next Bradley Manning or Edward Snowden, doesn’t behave in any of the “obvious” ways that would trigger coworker suspicion, their behavior modalities are subtle and go unnoted, allowing them to continue to be successful at inflicting damage.

Analytics to the Rescue

The worst thing we could do, far worse than doing nothing, would be what the administration’s policy requires; i.e. use coworkers to monitor each other and report suspicious behavior in a context where underreporting is punishable under the criminal code.  There’s absolutely no way that ends well.

If the goal is to solve the problem and mitigate the insider threat then we need to take the human out of the loop. The correct approach is to use digital sensors to collect a wide array of features that are representative of daily activity in a workforce and then feed such collection streams into an Analytics Process that objectively profiles behavior to separate normal behavior from unusual behavior as well as classifying non-threatening vs. threatening.  This is the best way to identify an insider threat; but not without its own set of problems.

First, the problem of determining “normal” behavior and separating that from “abnormal/anomalous” behavior.  On the surface this appears easily done with simple statistical methods, but with deeper reflection it gets much more complicated when dealing with human behaviors.  Even when a person does the same job every day, we do it differently each time. There is variation, i.e. “jitter”, in almost every aspect of both human behavior and organizational behavior. For example, is suddenly printing 100 pages when 20 is the norm something we should consider an outlier, or is it ok despite the fact it is statistically unusual? We end up needing a technology that can discriminate between “normal anomalies” and “abnormal anomalies”; which, despite the grammatical contradiction, is exactly the challenge.

Second, the problem of false positives.  Because bona fide threats are so rare compared to the number of everyday things people do that are different or unusual, false positives are inevitable. The social system breaks down if we are constantly casting a shadow of suspicion on good workers who express normal everyday behaviors.  This has prevented behavior profiling technology from succeeding in the past. Although researchers have invented various ways to crack the normal vs. anomaly problem, the technologies still produce a flood of false positives that make them impractical and unusable.

Third, the problem of scoring the threat itself.  Even when a behavior has been correctly profiled as “truly unusual”, it might still be ok. Radically unusual behavior is often a good thing, especially if we want a workforce to innovate, adapt, and progress; otherwise we might as well use robots.  It’s no easy task to profile a behavior and determine its potential as a threat. Expert psychologists and security researchers have never found any reliable predictive patterns, there is no “standard model” for a bad actor in a high-trust environment.

Insider Threat Detector (Shameless Plug Time)

At Sphere of Influence we have developed, and are currently field testing, the only technology in the world that actually does this in a practical sustainable way, at scale, and in real-time.  We use a common type of cyber-security appliance as a sensor to collect features that are representative of a workforce’s daily activities. That sensor drives real-time streams into a unique Analytics Processor that incorporates advanced profiling and unsupervised machine learning to create behavioral profiles and to identify human (and non-human) actors with truly anomalous behavior. One of our most important secret ingredients is our approach to radically reducing false positives, something that makes this type of technology practical for the first time.  Another secret ingredient is our solution to the problem of profiling behaviors in real-time, at scale, on incredibly large data streams.  The final stage in processing is to analyze the profiling output with a supervised machine learning layer that scores the threat.

Our technology has thus far proven effective at finding insider threats, simulated with AI bots, early in their activity cycle and before they would defect or go public. Unlike previous experiments and prototypes that have been developed by others in this area, ours is a practical and fieldable technology that effectively detects insider threats without clogging the bureaucracy with false positives.


Barriers to Adoption

Anytime an employer considers deploying a technology that collects on the behaviors of its workforce there will be concerns about ethics, privacy, and civil liberties.  People don’t like being monitored while they work, particularly if they think a subconscious tick might expose something private or interfere with reputation and career advancement. These are valid concerns that cannot be easily dismissed.  Some workforces will be more sensitive than others.  For example, people who work in classified environments already expect to be monitored and agree to random search every time they enter a facility; the same isn’t necessarily true for people who work at the insurance company, hospital, brewery, or bank.

We don’t open people’s mail!

Sphere of Influence developed Insider Threat Detector technology with these concerns in mind. The cyber-based sensor component doesn’t invasively snoop into what people are doing on the computer or the network. We don’t analyze payload data and the contents of private communications remain private.  Our technology doesn’t provide security staff any access to private communications or the ability to eavesdrop.  In fact, it works just as well on encrypted data streams as it does on unencrypted streams.  Our design goal was to be no more intrusive than technologies that are already common in large enterprises.

That leaves false-positives.  Our technology reduces false positives from a flood to a very manageable trickle, but there will always be some false-positives because the science is based on math, not magic.  This technology is intended to provide early warning and alert, not to accuse or indict.  We don’t label people as threats, we identify suspicious behavior and score the behavior on a threat scale – where the scale is adjusted so even the highest score is still low.  Investigation and forensics are required before anyone can be considered a threat.  That said, even this will worry some – hence, there will be barriers to universally adopting this technology. Regardless of those barriers, however, this is far less intrusive than pitting coworker against coworker under a cloud of universal suspicion, as is the current policy.

Shout Out for a Demo

If you would like a private demonstration of this technology please contact us. We love showing off!


Wake up! It’s the insider threat you need to worry about



Edward Snowden is the new face of the insider threat, the media even calls him the “Ultimate Insider Threat”.  This is someone who has the highest-level security clearance, endures a background reinvestigation every 5 years, takes a polygraph exam, and still betrays his sacred oath and trust of his employers.
When it comes to asserting workforce trustworthiness, industry and government are both guilty of over-relying on employment pre-screening, background investigations, and oaths.  These are effective to a degree and good first steps but obviously inadequate when it comes to preventing losses and breaches.

Insider threats are detectable because they don’t behave exactly like everyone else.  Maybe on the surface these people appear to be the same as their coworkers, but at some level their behaviors are different.  A sensitive enough instrument can detect such subtle differences in behavior, and if the noise of anomalies can be removed then high-quality actionable alerts can be generated from the “unusual anomalies”.  This is the basis of the Insider Threat Detection technology that has been developed by Sphere of Influence over the past two years.

The problem isn’t cyber-security, which is focused on the threat of digital attacks against digital assets.  This is an industrial security threat, where a person of trust betrays that trust and misuses access to cause deep harm or substitute a third-party agenda.  Unlike cyber-attacks, an effective insider might not even use your digital assets as the vehicle for attack or exfiltration, they might steal files from a safe or do other things.  However, if a person’s normal behavioral modalities change even slightly then shadows of those changes are often reflected how they use the computer, thus computer activity can yield a behavioral profile for an individual, even if the actual threatening behavior is more analog than digital.

By connecting a sensitive behavioral profiling instrument to a network we can construct individual profiles that are accurate enough to perform this type of anomaly detection. Such algorithm-synthesized profiles apply to human and non-human users of a network, giving some cyber-security crossover to this approach in addition to the industrial security focus. However, Insider Threat Detection is not cyber-security, it is industrial security that uses cyber-technology as a sensor.

In our case the goal of this technology is to detect the active insider threat early in the activity cycle. We believe strongly that there is no way to fully prevent insider threats from occurring because no background screening process on Earth will ever accomplish that. To defend against the insider we believe early detection of active threat behaviors is the key to loss prevention.
This is possible thanks to Advanced Data Analytics (Analytics 2.0) techniques which evaluate dozens (or even thousands) of simultaneous feature dimensions on Big Data under a powerful layer of unsupervised machine learning. What makes insider threat detection different from conventional Analytics 2.0 is that it must work on streaming data, in real-time, and at-scale.

At Sphere of Influence, because we have been so invested in Advanced Data Analytics these past few years, we were able to solve these problems and invented an instrument that does what I describe here.  We use it every day on our networks and it is already installed at beta customers, primarily law offices.

The bottom line is that even the most intense background checks are not good enough, you need to be able to detect insider threats when they become active and before those threats move to Hong Kong.


Tech that finds bad guys (and girls too)

A hotel worker in China entered Frank’s room while he was away at dinner and installed a new type of spyware on his laptop.  The spyware traveled home with Frank, waiting to connect to the corporate network. Once behind the firewall the spyware infected hosts, generated link charts of business relationships, harvested intellectual property, and collected information on employees and customers.  Occasionally it phoned home, passing data in small chunks that ultimately constitute a treasure trove of secrets to Chinese intelligence.  This went on for months without detection because it used very little bandwidth and communicated through drop points that were legitimate looking URLs in the United States. Anti-virus vendors had never seen this custom-made spyware before so they had no catalog of its signature.
Meanwhile, Cindy has worked for the company for three years but lately her political views have shifted toward the radical. She is loyal to an organization that operates a fringe website dedicated to spreading propaganda about the type of business the company does.  Cindy doesn’t talk politics at work, she keeps her opinions to herself and doesn’t work in critical areas. Cindy’s duties are in mid-level administration and her user accounts only grant limited access to servers, network resources, corporate documents, and production equipment.  Despite proper restrictions Cindy has regular access to a lot of data as part of her job; and because other employees are sloppy about network file sharing she might be able to find things she isn’t authorized access to.  When Cindy stumbles on something interesting she copies it to a thumb drive.  She doesn’t steal a lot in terms of megabytes and she doesn’t spend much time doing it.  Cindy is careful, 98% of the time she’s performing her normal work duties, it’s only 2% of her computer use that is about to cost the company millions.
In a company with thousands of retail POS terminals the management has no idea of an ongoing attack against their customers. Recently a new type of custom malware has been circulating that infects these POS terminals.  After infecting a terminal the malware skims credit card numbers and customer identity, phoning home through a sophisticated distributed botnet.  POS terminals are built on aging technology that is almost never updated with security patches and the vendor can’t even tell whether a terminal has been infected let alone do anything about it. How does management even know it has a problem?
Like most insider threat scenarios these have one thing in common, they are difficult to detect while they are happening.
A lot of people don’t realize we do hard-science R&D at Sphere of Influence. Almost our entire R&D budget is spent developing profiling technology (not the corrupt southern cop kind of profiling but the good kind). We build algorithms that detect and profile patterns of behavior, we call “patterns of life”. With this technology we can reliably detect anomalies in data that is noisy and full of “normal anomalies”.  Fraud detection and cyber-defense insider threat detection are probably the top two applications for this. Our newest technologies have unique advantages such as being able to detect zero-day attacks and spot malicious activity that hides in plain sight, all in real-time.
We have cyber-defense algorithms running today that easily spot the Cindy scenario, which is actually the Bradley Manning scenario from WikiLeaks. These same algorithms also detect the Frank scenario and the POS scenario with ease.
Profiling is a specialty within Data Analytics that’s basically about transforming large uninteresting and mostly indistinguishable data into high-value “patterns of life”.  Combined with unsupervised machine learning we can do some pretty amazing stuff.
I wanted to blog this because it’s cool.  We recently challenged our science guys to solve the insider threat problem and they made spectacular progress!
Related links:
In particular I think we nail these two scenarios.  Unfortunately, we don’t manufacture appliances, so getting our technology on a network near you is the problem.

Containing the Insider Threat – orange+demo 2.png (817.40 kb)