Posted by & filed under Android, Django, Machine Learning, Purple Robot, RapidMiner & RapidAnalytics.

While Purple Robot’s main features are its data collection mechanisms and embedded scripting environment, we’ve been working hard to integrate machine learners. Being able to execute learned models on the same device that is collecting data is enormously powerful and allows us to build functionality that takes specific actions when a learner predicts something useful (e.g. “Your mood seems to be unusually poor at the moment – would you like to play Angry Birds to take a break?”) or to help us collect a fuller dataset to improve our models of our user (e.g. “The confidence in predicting your location is low because your latitude and longitude fall outside the bounds of your previously-seen area. Where are you?”).

While implementing robust modeling on the mobile device opens up many interesting possibilities, limitations of typical mobile devices constrain our opportunity. On a technical level, these limitations include:

  • Battery power and lifespan
  • Computational processing power
  • Limited memory
  • Limited & expensive network access

To create a successful mobile experience, we have to weigh the impacts of the systems we are creating with how the user expects their device to behave. For example, we can’t be constantly generating models because that would drain the phone’s power too quickly and the advantages of the mobile platform are lost when the user has to keep it tethered to a charger to keep the device functional. We can’t go too crazy with I/O or memory usage because this will impact the responsiveness of other software running on the system. We also can’t use a cellular network in the same way that we might use a broadband connection – mobile users have much smaller data allocations that are orders of magnitude more expensive.

Given the opportunity offered by machine learning and data mining technologies, we’ve been exploring different approaches to try and capture the best of both worlds. In Purple Robot, we have already addressed some of these issues on the data collection front (such as our store-and-forward data collection & transmission architecture) and some of our approaches mirror what’s worked for us in similar contexts. The remainder of the post will outline how we’re adding learner functionality to Purple Robot.

Training the models

The fundamental and inescapable truth that complicates our life is that Purple Robot is capable of capturing a volume of information in real-time that exceeds our ability to analyze it in any time- or space-effective way on the mobile phone. Hardware sensors can be configured to collect samples exceeding 100 Hz, and software-based probes typically collect data on one, five or ten minute intervals. Running a typical configuration, a mobile device can collect 10 to 15 megabytes of data in an hour. Given that we typically wish to build models from many hours’ worth of data, the overall memory footprint for the dataset in question can consume hundreds of megabytes (or more) of RAM in a typical scenario.

On the software side of things, mobile operating systems are quite conservative in the amount of memory that they will allow user apps to consume before forcefully shutting them down. The maximum size of the heap that Android will allocate varies by device, but it’s typically between 16 and 48 MB. When we load the necessary resources to run the app, any libraries we may need to do a proper analysis, the remaining memory available is simply insufficient for most training algorithms’ implementations.

Consequently, we’ve adopted an architecture that forgoes model training on the phone, delegating that responsibility to a server API that can retrieve the data collected from our storage systems and train models on that data on traditional desktop/server hardware that supports much more physical memory and related software infrastructure like disk-based swap spaces.

In our own implementations, we’ve adopted RapidAnalytics as our default learning engine because it provides a user-friendly interface that we can use to create workflows that process our data, train & evaluate models, and package the models (with assistance) in a format that can be expanded, interpreted and executed on the mobile device. (More on this below.) The RapidAnalytics server product provides a simple route from taking customized workflows authored in RapidMiner and exposing that functionality via a web API.

Surrounding the RapidAnalytics component is a Django web application that implements the necessary functionality that RapidAnalytics does not include. The Django application provides the following services:

  1. Retrieving and packaging data from our storage system (Postgres) into a format ingestible by RapidAnalytics (ARFF).
  2. Maintaining the batch scheduler that handles periodic tasks such as updating the cached ARFF file for a given participant & label. This batch system also automates the training and evaluation of models through RapidAnalytics.
  3. Providing a researcher-facing data dashboard that provides tools to assess the quality of the data being collected from the mobile devices.
  4. Providing the transmission channel for sending models to the mobile devices.
  5. Translating any proprietary data formats (RapidMiner or otherwise) into formats suitable for use on the mobile device.

The current implementation status of the server component is that we are successfully training models (decision trees) and caching the results on an individual basis. We are currently improving that process to make it more robust to anomalies like missing data when using more sensitive learners with implementations that don’t cope well with the data that we’re collecting in the messy real world.


Executing the models

Once we’ve generated models using our server infrastructure, the mobile device fetches the trained models from the server and replaces any existing model for a given label or concept with the newest one. Note that the mobile device does not have a sufficient history nor the metadata to determine if one model is better than another, so the server infrastructure’s solely responsible for guaranteeing that it’s making the best model available to the mobile device.

Once the device has the model, we execute it locally either using our embedded scripting environments (JavaScript or Scheme) or with the assistance of an existing native Java library. In the case of decision trees, we take the model produced by RapidAnalytics and generate a JavaScript if/then tree that implements the decision tree model. In the case of support vector machines, we take a textual representation of the support vectors and generate a native evaluator using the LibSVM library.

By offloading the training of the learners to the server infrastructure, but keeping the real-time evaluation of the data on the local device, this allows us to reap the benefits of computationally-expensive training algorithms while retaining the ability to remain responsive and accessible (in periods where network connectivity may be spotty) on the device itself. The two main drawbacks to this approach is that models on the mobile device may become stale if it’s unable to retrieve updated models from the server for any reason and the choice of learner algorithms is not entirely open-ended since we must still be cognizant of how device limitations can constrain model execution.

The most salient example of these constraints that we’ve encountered is dealing with missing data. During the lifetime of a mobile device, particular sensors may be deactivated for a variety of reasons, including limited predictive utility, selective shutdowns by the system to conserve battery life, or the user changing the parameters under which the sensor operates. Consequently, the feature set that we provide with the labels is quite likely to be dynamic in structure. New feature values may be introduced in later data sets, and values may later be removed.

For models that are robust against missing values (such as C4.5 decision trees), this isn’t a major issue. However for other algorithms (such as RapidMiner’s SVM implementation), missing data can prevent the classifier from producing results, so imputing the missing values becomes an important part of the model execution. Since the mobile device does not have the available storage capacity to keep a full history of everything it ever sensed, this lack of history can preclude techiques that depend on the historical distribution of values to compute a replacement for the missing attribute. Consequently, on the server side, algorithms must be chosen and configured with these execution limitations in mind if the models used by the device are to have the same performance characteristics as the models trained on the server.

Current status

On the mobile device side of the software development, we’ve run some small feasibility experiments with deserializing and executing decision tree models and have been successful on that front. While LibSVM has been included with the Purple Robot distribution for several weeks, we are still working on resolving the issues on the server that will reliably produce a trained SVM in the presence of missing and dirty data. Once we are successful on that front, we already have a process in place for converting the RapidAnalytics SVM output into a format suitable for deserialization (via LibSVM) on the device.

While we’re still in the process of assembling this machine-learning infrastructure, we’re excited to begin applying it in a productive user-facing manner on the mobile devices. As I mentioned in the introduction, I believe that marrying the real-time execution of models with our existing trigger framework will allow us to create more personalized and responsive interventions and products than our current scheduled-based systems. I’m also quite interested to see if we can also employ the confidence estimates of our predictions help us obtain more data where we need it and interrupt the user less often when we’re already producing reasonably confident predictions in contexts we’ve already observed repeatedly.