Posted by & filed under Data Management, Machine Learning, Purple Robot.

Purple Robot (PR) is an excellent data-collection and device-side background application, with an impressive functionality set. But for any non-trivial use – and particularly for use in an institutional setting (such as that of CBITS) – it quickly becomes necessary to enable automated data-ingestion.

tl;dr

Purple Robot Importer and Purple Warehouse are a robust and scalable system for receiving sample data uploaded by Purple Robot (or any other client implementing its simple message interface). This includes high-frequency data like that collected by the accelerometer. The data are stored in a individualized SQL databases, making the system fast and easily queryable.

Longer Version

How to do this? Consider the following facts (guided partly by the “three Vs” framing of “big data” problems, since even though we don’t particularly have “big data”, we expect we’ll eventually get to such a state):

  • The volume of data can be quite large, given sub-second sampling rates on several of the probes.
    • The Accelerometer Probe, for example, is capable of capturing the X, Y, and Z dimensions, timestamped at the nanosecond level, at a rate upwards of 100 samples/sec or more (where “samples” is here specifically the context of science, generally; in the rest of this post, these would be “sub-samples”, in the context of Purple Warehouse). Other sub-second probes include the Gyroscope Probe.
    • Additionally, other probes sample frequently, perhaps every few seconds, such as the Pressure Probe and the Temperature Probe.
  • The velocity of the data outputted by the swarm of mobile devices sending us data will rise with mobile and wi-fi network upload speed increases.
  • The variety (shape) of the data structures outputted by Purple Robot will vary.
    • Common attributes exist across PR probes, but each probe inherently offers different dimensions from other probes.
    • Added to this, device manufacturers, motivated by the competitive advantage of greater functionality, continue to add new sensors – sensors that we may wish to capture.
    • Non-sensor uses also create varying data structures.
  • Conventional big-data tools (e.g. Hadoop and its ecosystem of Mahout, HBase, Hive, Impala, etc.) are too heavy and user-unfriendly for a small technical team on stereotypically-tight university research budgets with bright-but-not-technically-focused researchers as our customers.
  • Near-real-time querying is desirable, given that our ability to perform behavioral interventions in a meaningful way with at-risk study participants may demand rapid response.
  • Researchers tend to use data-analysis tools like Matlab, RapidMiner, R, and Excel – none of which integrate with traditional “big data” NoSQL-based databases like MongoDB, but which are capable of integrating with SQL-based RDBMSs.

Purple Robot Importer (PRI) and Purple Warehouse (PW) are custom in-house solutions to this problem.

These services have been fast, stable, and robust, finding use for multiple trials and by external collaborators since November 2012, with only a couple of bugfixes (both very early-on), and very few feature upgrades necessary since its initial deployment.

Architecture

The data-flow architecture looks like this (borrowing a slide from a presentation given about this system at ISRII in May 2013):

PRI, needing to be a high-speed, write-only API layer, was conceived before any code was written as a stateless, functional architecture, hewing-closely to the design intentions of Node.js applications. This makes it easy to scale PRI: simply add additional PRI instances and round-robin them at the load-balancer level.

Alternately or additionally, one may add more cores to the PRI host: both Metamorphoo and Trireme have been parallellized via the “cluster” module available for Node.js, taking full advantage of multicore systems by instantiating n_cores-1 worker processes, managed by a master process.

Purple Robot Importer (PRI)

PRI is the system that receives sample data uploaded by PR, and stores it in PW.

PRI fundamentally consists of two web-service applications:

  1. Metamorphoo
  2. Trireme with the Dingo extension

Both applications provide a large number of HTTP(S) routes, most of which are not relevant to the PRI system.

Metamorphoo — so-named because its original purpose was data-transformation — exists as the application to which PR connects from a mobile device.

Trireme is a data-access layer (originally built as an API layer to a MongoDB instance) which houses Dingo as a module that Metamorphoo calls during sample-upload processing.

Dingo is basically a web API for executing CRUD SQL statements against either a Postgres instance or MySQL instance (the latter is a legacy carryover from iteration 1 of PRI).

Metamorphoo receives probe samples from PR in a JSON message format known and enforced both on PR and on PRI (with integrity-validation on both ends prior to all processing). It then converts these to a set of SQL statements that — from scratch — will create or update a user’s database to contain columns representing all of a sample’s top-level dimensions and their sub-samples. This is possible regardless of datatype, via datatype inference on the first observation. (i.e., this performs a purpose-limited set of ORM functionality.) Samples and sub-samples are converted to SQL INSERT statements.

Each of the database existence- and structure-validation steps, and their corresponding creation/modification steps when insufficient structure exists, are performed step-by-step as SQL statements executed on PW via Dingo, until the DB contains the necessary structure to represent samples and sub-samples. Structural checks occur for each message sent by PR (but not for each sample or sub-sample being inserted – this would degrade performance far too-much).

At this point, the set of sample-insertion statements is sent Dingo for submission to and execution by PW.

Purple Warehouse (PW)

PW is a data warehouse built on PostgreSQL into which all PR data is uploaded by PRI. It sits at the heart of the entire system, taking all data inflows, and responding to all queries for data.

PW stores the collected sample and sub-sample data.

Each username maps to a single user’s database.

Each probe type (e.g. accelerometer, location, etc.) maps to a sample table in the user’s database.

Each sample maps to a row in its sample table.

Each sub-sample maps to a row in the sample table’s corresponding sub-sample table.

As of Sept. 19, 2013 our main PW instance houses the roughly 490GB of data we have collected via PRI since Nov. 2012, of which 330GB has been stored since about May 18, 2013. Much of the total is from usage by researchers in activity-recognition who are making heavy use of the accelerometer.

Conclusion

PRI and PW provide the foundation for widespread mobile device sensor data-collection at CBITS. Whether low-volume data, such as survey responses, or high-volume data, such as accelerometer samples, this system has risen to the challenge of collection for hundreds of users.