Behavior Identification Mathematical Model

Identifying people's behavior for software analysis

  ·   12 min read

Introduction

The Oxford English dictionary defines identity as “The fact of being who or what a person or thing is”, and “The characteristics determining who or what a person or thing is”1 , giving the idea that identity is compound of a set of attributes, mixed in, practically, infinite ways, making each person that has lived, is living and will live, unique.

The internet has been dealing with the identity problem in an old fashion way, through silos of shattered identities. Thus, the Sovrin model of Digital Identity stands as a decentralized, transparent, secure and reliable way to identify people on internet keeping close attention to security and, nondisclosure of sensitive information, avoiding correlation. To achieve this Sovrin proposes a three-dimensional space of identity, comprised of Relationships, Agents and, Attributes.

But identity is a more complex concept that encloses habits and behaviors. This work aims to find ways to protect entities’ devices based on learned behaviors expressed as a mathematical model avoiding dependency on massive collected amounts of user data.

Background

The evolution the World Wide Web has gotten in the latest years, took humans into a new level of abstraction: our presence is online. Social networks on internet provided the means through rich functionality sets to enable us to control our digital persona as we wish to appear to others. Rowe2 arises the following question: “is the digital identity which a user constructs on a Social Web platform representative of their real-world identity?”. In answering this question, they empirically demonstrate a significant overlap between both kind of identities, the real-world one, and the Social Web one. Their experiments show that it is also possible to identify a person by the relations they keep with others, defining two types of relationships, strong-tied (i.e., relations which are driven by frequent interactions), and weak-tied (i.e., infrequent interactions to others). Their contribution is to set two metrics used to measure the similarity between social networks (i.e., online and offline ones). Relevance, measure the proportion of digital social network affected by strong-tied relationships, whereas coverage, measures the extent of replication of users’ real-world social networks within online networks2.

Since the psychological context presented above shows how humans tend to project ourselves into digital identities, the question now is, how can digital identities be formalized? S. Wilson et al.3 propose an OSI-like stacked model in order to help IoT (i.e., Internet of Things) privacy. This model, presented here from top to bottom, includes; Relationships, interactions between users and service providers; Identities, typically indexed through codes, relating relationships to other entities; Attributes, miscellaneous pieces of information, entities, such as providers, need to know about people they deal with, this information is used to identify a person (e.g., Know Your Customer practices); Authentication Data, codified attributes that are exchanged among sub-systems using protocols; Authentication Metadata, ‘data inferred from actual data’, used to confer the provenance of the original data; and, Deeper Network Layers, that transport data, metadata and authentication information. To achieve such stacked representation of data, awareness is made about, collection limitation (i.e., limits to the collection of personal data), purpose specification (i.e., specifying what the data is collected for), use limitation (i.e., undisclosed collected data), and openness (i.e., a general policy of openness about data treatment).

Users typically fulfill forms to access online services, providing sensitive personal information. The risk of disclosing such information, if not properly protected, is that it might be misused. The goal is to protect that information in order to avoid identity theft for fraud. Bertino et al.4 have proposed a “privacy-preserving multi-factor identity attribute verification protocol”. This protocol uses an aggregate zero knowledge proofs of knowledge (AgZKPK) procedure to let users prove their identities without proving the actual data. Their approach, leveraged by efficient cryptographic protocols, let them verify and identify digital identities for cloud platforms.

One of the most used frameworks for creating and managing digital identities is Identity and Access Management (IAM), comprised of four components: authentication, authorization, user management, and user directories. There are two types of infrastructures: centralized, tied to organizations, where identifiers and attributes are created, and that information is only valid within the organization context; and, user-centric, that provides decentralization of identity and delegates the control over identifiers and personal information to the user 5. Achieving the latest is done by using DLT (Decentralized Ledger Technology). However it faces three big challenges: identity fraud, data breaches and, lack of re-usability of identities. The typical discourse hints that DLT provides remarkable benefits for transparency, immutability and, decentralization. Nonetheless, each one of them has issues, for instance, transparency collides with nations’ legislation on privacy; immutability challenges the very nature of people’s mindset change; and finally, decentralization doesn’t consider other aspects such as jurisdiction or business agreements Dunphy et al. Other related problems have to do with the fact that users must be responsible for keys from PKIs (Public Key Infrastructure), problems of user’s identities delegation, and overall, user experience handling such assets.

Most online interactions happen due to judgments about characteristics of people. Formal systems collect information about specific identifiers in order to come up with reputation scores. This has been done using artificial intelligence and other techniques present on P2P networks6. However Windley et al. show that reputation systems built upon aggregation are way too better than game-theoretic or stochastic methods. There are several reason to justify such design decision, but simplicity is by far the most important: calculating reputation on-the-fly reduces storing, and also enables users to carry their identifiers between different contexts.

Since its inception, the internet was built up for machines, identifying machines is something the stacked OSI model perfectly achieves. However, humans were never intended to interact on the internet in the way machines do. Thus, human identity is the missing layer on internet. Such lack has forced how to build internal databases, where the users have no control, even worse, the digital identity of users is scattered across different organizations all around the world7.

The internet has to evolve towards digital identity, and to do so, it must guarantee having three key elements: security (i.e. protecting from unintentional disclosure), control (i.e. identity owner must be in total control of its identity) and, portability (i.e. users can carry their identities among several contexts)7.

Allen, C.8 states the evolutionary path on digital identity:

  • Centralized, identities are totally owned by a single organization.
  • Federated, where the user has the chance to use their credentials in other services.
  • User-Centric, where any flow of information from claimers to relaying parties only happens if the user has requested it.
  • Self-Sovereign, the higher level, where the three key elements mentioned above are present, because it the system has been decentralized, the user fully controls their identity (i.e. the user becomes their own identity provider) and, no organization holds rights over the user’s identity.

In order to compete with documents, credentials from now, issued by Issuers a system must guarantee, as stated by Khovratovich9:

  • Compatibility among several organizations,
  • Unforgeability to avoid false positives,
  • Scalability in order to support hundreds of interactions simultaneously,
  • Performance/Low latency to quickly resolve requests,
  • Revocation so an Issuer can let Users and Verifiers know when a credential was revoked.

But there is another group of four characteristics that are way too appealing:

  • Minimal dependency, so “an Issuer should not be involved during the preparation and presentation of proofs”.
  • Privacy/Anonymity. There is no need to disclose the actual user’s identity too Verifiers that do not need such information.
  • Unlinkability. No credential can be linked upon presentation at different places.
  • Selective disclosure. The user has the control on what they keep private or what goes public.

These are the basis of the Sovrin Identity Network (SIDN). It “consists of multiple, distributed nodes located around the world. Each one has a copy of the ledger. Nodes are hosted and administered by stewards. Stewards are responsible for validating identity transactions to assure consistency about what is written on the ledger and in what order. They do this using a combination of cryptography and an advanced Byzantine fault tolerance algorithm”6.

Sovrin tracks keys and identifiers, avoiding correlation by allowing the user to use different identifiers with everybody they relates to. Claims are assertions made by a user or party about themselves or another, they are digitally signed so that anyone getting the claim can be sure who emitted it. Disclosures, or “disclosure proofs allow claims to be used without disclosing unnecessary information about the subject” \parencite{Windley2016how}.

Paul A. Grassi et al.10 also show the general Digital Identity Model, where an applicant applies to a CSP (i.e. Credential Service Providers) through an enrollment process; then the CSP proves that applicant, if it succeeds, the applicant becomes a subscriber; “authenticators and corresponding credentials are established between the CSP and the subscriber”10; the CSP holds the credential, alongside its status, and the enrollment data collected from the credential; the subscriber keeps their authenticators.

As for clustering techniques, fuzzy and possibilistic c-means algorithms are widely used for soft clustering, where each object is assigned into different clusters, successfully used for nonlinear systems identification. Adding weight to these techniques, negative effects by noisy objects are minimized, proving to increase their performance, effectiveness and scalability11.

Krechowicz et al.12 introduce an ‘automatic hierarchical clustering mechanism’ for scalable distributed two-layer datastore, that will improve features such as data indexing, querying on specified subspaces, and automatic outlier detection. The idea behind this technique allows to split information in two layers, the first one that manages data and the second one that actually stores the data.

Wang et al.13 analyze different clustering techniques and also proposes a split-merge evolving k-clustering method that is as flexible and simple as k-means and agglomerative hierarchical algorithms, where the data sets are randomly splited into clusters, the process is repeated through several iterations, disposing bad clusters, and finally merging the resulting ones into a high quality cluster.

Problem definition

The Sovrin identification system uses a R3 space, where the axes represent Attributes, Relationships, and Agents (i.e., devices). This is a well representation of digital identity, but a better representation might include other factors, such as behavior. Including a fourth orthogonal axis (as a R4 hyperspace) into the identity representation might increase security features.

A good source of data for behavior analysis can be found in mobile devices. Through their sensors, lots of data can be gather to perform the desired analysis (e.g., accelerometer, gyroscope, magnetometer, GPS, proximity sensor, ambient light sensor, microphone, touch screen sensor, fingerprint sensor, barometer, thermometer, air humidity sensor, and even Geiger counter). Based on such data, a mathematical model can be devised by means of clustering techniques.

Current incremental learning systems fit models based on slightly modifications upon new data, rather than creating a whole new model based on historical and new data, this make the incremental learning method suitable for getting new knowledge. However, it relies on having access to both historical and new data14. Having a system that incrementally learns about users poses a great concern about holding sensitive data and privacy.

As humans, we have the ability to recognize people, as a result of a match between both, an structural encoding (codification upon aspects of face structure) and, principally, encoded representation (abstractions done as the main tool to identify). A set of identity-specific semantics, such as names, and ‘person information’15. To recall such representations, there is no need to bring all the records that led to such representations.

Inspired on the latest idea, it might be possible to allow devices to create the mentioned model using a set of initial data, and then disposing it. Once new data is available, it will be considered into the model, and the data that gave the new feedback will be disposed again. This might be done using different clustering techniques, or combinations of them.

Goals

Main goal

Devising a weighted mathematical model upon mobile devices’ sensors data gathering through clustering techniques and disposing data that feeds the model.

Secondary goals

  1. Choosing two planes to work with (i.e. the most suitable sensors).
  2. Developing an Android application to access system sensors data and storing into a light database the gathered information.
  3. Developing a web API to collect the data, and resulting weighted mathematical models for each testing device.
  4. Testing different clustering techniques to come up with the mass center for each category on cloud.
  5. Getting a light algorithm capable of running on mobile devices to build the weighted mathematical model.
  6. Saving into a light database the results of data reading and how near it’s according to the weighted mathematical model.
  7. Developing improvements in the Android application in order to reduce battery consumption upon reading and saving data from sensors.

Methodology

This work proposes a weighted mathematical model construction of a behavior pattern built upon data gathered from mobile devices’ sensors. The weighted model will not need the data, allowing its disposal. Thus, privacy will be ensured. Once the model has access to new sets of data, it will consider them to modify itself (re-learning process), keeping isolated data objects outside the model, increasing accuracy. Again, once the data has fed the model, it will be disposed keeping the model as updated as it can be, ensuring the user that their information hasn’t been stored.

The strategy will use clustering techniques to isolate groups of data for each sensor in the mobile device. Each group will create a kind of mass center for the resulting data in local groups, assigning a weight based on characteristics of each sensor (see Figure 1).

figure_1
Figure 1 - Clustering data by agglomeration.

The agglomeration algorithm will go forward grouping influential smaller clusters into a big one that will be the final mass center (see Figure 2).

After this, the algorithm will dispose the information gathered previously. Once new information is added a whole new cluster will be calculated and mixed with the existing one. Repeating the process over and over again consolidating a stronger mass center.

figure_2
Figure 2 - Final cluster group with defined mass center.

Each group will be considered to be a plane by itself. Connecting the mass centers of each plane will draw a curve (see Figure 3). This curve can be projected into two planes. The resulting curves are going to be considered as the mathematical model.

figure_3
Figure 3 - Curve generated upon mass centers for each plane.

In a first attempt, this work contemplates connecting only two planes, thus, obtaining a lineal model curve. However, the more sensors can be added, the better. In this case, adding at least 3 planes can lead to draw a soft curve (using Bezier curves or similar).


  1. https://en.oxforddictionaries.com/definition/identity ↩︎

  2. https://oro.open.ac.uk/26689/ ↩︎ ↩︎

  3. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3711152 ↩︎

  4. https://www.semanticscholar.org/paper/Privacy-preserving-identity-attribute-verification-Steuer-Fernando/b22f12cee32487c2c71678cb504d4c319571ddb6 ↩︎

  5. Dunphy et al. https://www.semanticscholar.org/paper/Decentralizing-Digital-Identity%3A-Open-Challenges-Dunphy-Garratt/2dfd2470deeb680401b5d594a6e70e76dbc13b6c ↩︎

  6. Windley et al. https://www.semanticscholar.org/paper/Decentralizing-Digital-Identity%3A-Open-Challenges-Dunphy-Garratt/2dfd2470deeb680401b5d594a6e70e76dbc13b6c ↩︎ ↩︎

  7. Tobin et al, 2016. Missing link. ↩︎ ↩︎

  8. Allen, C 2016 https://www.semanticscholar.org/paper/The-Path-to-Self-Sovereign-Identity-Allen/03a396becd6c730c6142204e9429ce4503649bf7 ↩︎

  9. Khovratovich https://www.semanticscholar.org/paper/Sovrin-%3A-digital-identities-in-the-blockchain-era-Khovratovich/9c4c7d60cd883e0bf85dd9eaccc8ed49b481ba77 ↩︎

  10. Paul A. Grassi et al., 2017 https://pages.nist.gov/800-63-3/sp800-63-3.html ↩︎ ↩︎

  11. Yang et al., 2018. Missing link. ↩︎

  12. Krechowicz et al., 2018 https://www.researchgate.net/publication/329958834_Hierarchical_Clustering_in_Scalable_Distributed_Two-Layer_Datastore_for_Big_Data_as_a_Service ↩︎

  13. Wang et al., 2018 https://onlinelibrary.wiley.com/doi/10.1002/sam.11369 ↩︎

  14. Fan et al., 2018. Missing link. ↩︎

  15. Bruce et al., 1986. https://bpspsychub.onlinelibrary.wiley.com/doi/abs/10.1111/j.2044-8295.1986.tb02199.x ↩︎