How to Make Robots Reliable


Written by James Edmondson, Senior Principal Scientist — Scalable Infrastructure.

The Origin

Back in the late 2000s, I was in grad school at Vanderbilt University attending an Advanced Operating Systems course under a professor named Larry Dowdy. Larry’s course was hard and fun, but at the time, I wondered why a course on Advanced Operating Systems really had to be hard. To me, distributed systems just made sense and I felt that the problems that many students struggled with concerning timing, control, and memory sharing across networked processes could be overcome by just providing better programming abstractions.

I was studying under my adviser Douglas Schmidt, of ACE+TAO/CORBA fame, and during the course, I started writing a distributed operating system middleware that allowed me to quickly program any distributed algorithm we discussed in class—usually during the lecture. It wasn’t really until the day that we were reviewing Dijkstra self-stabilizing processes that I really thought that I might be on to something with the technology I was working on.

While Larry drew out by hand the unstable states and how they converged into stable, repeating systems, I simply wrote an almost exact copy of the algorithm Dijkstra had written in his papers. I created a language called the Knowledge and Reasoning Language (KaRL), which processed knowledge from a networked database I had also developed, and I showed Larry all of the different paths of the unstable systems into stable ones. These two technologies—a real-time interpreted knowledge language and a distributed knowledge base—became the foundations for the Multi-Agent Distributed Adaptive Resource Allocation (MADARA) project, which I initially released on Google Code during my grad work at Vanderbilt between 2007 and 2009.

Shortly after the course, which I audited a second time because I enjoyed distributed processes so much, I began to attend seminars at the Vanderbilt Institute for Software Integrated Systems, including two by a Turing Award winner named Joseph Sifakis, whose foundational work with Ed Clark of Carnegie Mellon University on software model checking had helped bring about an age of reliable and mathematically provable distributed system theory. I asked Joseph a lot of questions and  and for a very specific reason: I had already made a system for creating distributed processes and coordination, but I wanted to do more. I wanted to give people not only the ability to intuitively create distributed systems, but also to make distributed systems they could trust with their businesses, ideas, and even their lives.

About the same time that I took Larry’s course, I also took some of my core artificial intelligence grad courses. One of the more interesting ones was given by Julie Adams for Distributed Artificial Intelligence. I ended up doing a lot of outside reading on Swarm Intelligence, Machine Learning and AI, and Julie lent me a couple books to pour through to help me on my quest for distributed AI. I began using MADARA and KaRL to create small swarm applications—evolutionary programs that evolved intelligence over time from seed programs and usually in a style that mimicked the self-stabilizing systems from my operating system course. These became some of the core concepts I would eventually deploy in the Group Autonomy for Mobile Systems (GAMS) project, but that would come years later after many lessons learned while interacting with students and researchers seeking to program their own distributed AI with robots and virtual agents.


When did MADARA and GAMS start focusing on reliability?

Shortly after I graduated, I managed to land a research job at Carnegie Mellon University. Within my first year, I met one of Ed Clark’s former students named Sagar Chaki,  who was already pushing the state-of-the-art in formal methods—a field that focuses on reliability in software development through mathematical proofs and rigorous techniques. Sagar and I began talking about my visions of the future of swarm intelligence, multi-agent systems and distributed processes, and how we might merge the visions of Ed Clark and Joseph Sifakis with the push toward distributed artificial intelligence and systems.

We applied for some research funding on a project called Model Checking for Distributed Applications (MCDA) and a follow-up project called Distributed Adaptive Real-Time (DART), and we made progress in these areas. We released our work as open source. We worked with researchers at the Air Force Research Lab and other government agencies, and we talked with thought leaders in government and academia about the kind of future we saw in front of us. Distributed artificial intelligence in not only clusters but also in robots all around us.

We live in an age of robotics and autonomy. The decision to push MADARA and GAMS toward reliability as well as scalability and capability was not really so much as a choice but an acknowledgment of what we perceived as an inevitability and a requirement for our future. We will need to be able to trust autonomous systems. We will need to be able to prove their correctness, long before we deploy them. We will need to be able to quickly address any errors. And that constant focus on reliability, provability, and control is what will drive all future development of MADARA and GAMS.


What can you do with MADARA and GAMS? Why might we use them?

MADARA is a really scalable, predictable and controlled way of distributing knowledge and understanding among agents. It was primarily developed to support thousands and maybe tens of thousands of robotics systems that had to operate and collaborate at the edge—places where communication was going to be disrupted. Think of the edge as ancient explorers combing through unexplored areas to see the world —  places where messaging is difficult and unreliable. This design philosophy and concept of operations forces us to think about how to not only build highly scalable systems that can potentially communicate with large-scale deployments in perfect network conditions, but also how to address robustness in denial-of-service situations and gradients of quality-of-service in networking, computation, and control.

The MADARA knowledge base and network transport layer were built to not only be extensible but also be very predictable, wherever possible. We do not guarantee network reliability—i.e., that every message or piece of knowledge you ever generate will be synchronized throughout the distributed system all the time. However, we do provide mechanisms to understand, analyze and even formally guarantee consistency of knowledge and how the system will behave once communication and computation are established and performed. We also provide mechanisms to minimize asynchrony and enforce gradients of reliability when you can relax temporal or logical constraints on the system. Within reasonable boundaries of the scope and mission of the robotics platforms and the needs of users for correctness, we can adjust the reliability, safety, and predictability of distributed agents and processes or even dictate the execution of threads within a single host, multi-threaded robot. This gradient of correctness and tolerance for variability can be useful when dealing with probabilistic systems like machine learning.

Ultimately, we’re aiming to provide both performance and rigorous reliability in our software suites because the future of trust in autonomous systems has to have both. You should be able to scale the systems you create with MADARA and GAMS and trust them to operate safely around you. It takes more than just a set of middleware to accomplish this, though. It takes a development process that prioritizes rigor, reliability and reproducibility above all else. At Shield AI, we’re trying to change the way that people think about robotic systems, including how to make them both trustworthy and safe.