RFC: Changes in DTrace to allow for distributed operation

Fri Dec 30 17:37:54 UTC 2016

Hello,

I have been working on extending DTrace to allow for a natural way of tracing in
a distributed environment. This would consist of being able to trace events on
different virtual machines, remote servers with access, cluster nodes and so on.
I will summarize the changes I have made and have thought of making, outly all
the design tradeoffs, flaws and merits of each design tradeoff I have thought of
making in hopes of getting feedback from others interested in distributed
tracing.

The following abbreviations will be used:
 instance -> Operating system instance, running on a VM or bare metal.
 UUIDv1 -> Universally unique identifier version 1 as per RFC4122
 UUIDv5 -> Universally unique identified version 5 as per RFC4122
 host -> the DTrace instance running on the machine that issued the DTrace
         script.
 DDAG -> Distributed directed acyclic graph

Starting off with an added struct in the kernel as a part of the DTrace
framework:

typedef struct dtrace_instance {
	char *dtis_name;
	struct dtrace_provider *dtis_provhead;
	struct dtrace_instance *dtis_next;
	struct dtrace_instance *dtis_prev;
} dtrace_instance_t;

where:
 dtis_name -> instance name
 dtis_provhead -> first provider in the instance
 dtis_next, dtis_prev -> doubly linked list nodes

- Each instance is identified by it's name, which implies that once an instance
  with a given name is created, all other instances with that name will be
  identified equally on the host.
- Each new instance is added at the start of the list and becomes the new list
  head.

Merits:

- The instances being identified by their name allows for an easy transition
  between the framework and the scripts one would be writing.
- There is no redundancy in the list, which allows for both less memory being
  used, less indirections in traversing and looking up probes in the hash in
  order to identify which instance they belong to.

Flaws:
- This does not identify the instance that fired the probe in an unique way. In
  order to get this information the provider needs to be known(however, this is
  known from the dtrace_probe struct). The problem with this approach comes when
  we want to send the appropriate information on level up(towards the host). What
  needs to be sent is the probe ID, which then needs to be mapped to the
  appropriate ID on the host. 
- Using just the instance name is not enough to identify which instance the
  provider/probe belongs to.

Possible resolution:
- A probe ID could be sent over to the host with the change in the DTrace
  framework being made so that dtrace_probes array is no longer kept globally.
  Instead, it would be kept in the dtrace_instance struct. This would allow to
  easily identify the instance where the probe needs to be fired, and would
  eliminate the need for the additional hash table.
- In order to be able to identify the instance that the provider belongs to, a
  UUID could be kept in the way that will further be explained. Additionally,
  the dtpv_next pointer could be used differently in such a way that it is no
  longer a list of providers, but a list of providers in an instance. This could
  be accomplished by keeping a list of providers of each instance in the
  dtrace_instance struct, or alternatively, implementing the semantics of the
  provider list differently, so that it can easily be identified which providers
  belong to which instance.

Another thing that needs to be changed is the way that providers are identified.
In a distributed setting, it is not sufficient to identify a provider based on
it's memory address, which is what DTrace currently does. This can be done
through combined use of UUIDv1 and UUIDv5.

- Each provider would have a corresponding UUID assigned to it. The way this
  would be done is starting at the endpoint. It would then advertise it's
  namespace-local UUID(UUIDv1 in this case) one level up. That instance would 
  then generate a namespace-local UUID for the providers that originate from the
  instance that has just advertised it's UUID. The UUID in this case would be a
  UUIDv5, combining the UUIDv1 generated in the endpoint with the name of the
  instance. The UUIDv5 generated on the node would be kept as a namespace-local
  UUID on each provider that originated from the endpoint. This would then
  further be advertised one more level up, again, generating a UUIDv5. Using
  this, two DDAGs would be built implicitly. This can be demonstrated on the
  following topology:

                VM{0...n}{0...m}
               /
      VM{0...n}
     /  
   P1
   | 
   | 
  /
H  ----- P2 - VM{0...n} - VM{0...n}{0...m}
  \
   |
   |
   Pk
     \
      VM{0...n}
               \
                 VM{0...n}{0...m}

where P{1}, ..., P{k} are bare-metal machines, VM{0}, ..., VM{n} top level
virtual machines and VM{i}{0}, ..., VM{i}{m} nested virtual machines in the i-th
top level virtual machine.

The nested virtual machines, VM{i, j} would generate their own UUIDv1 for all
their providers. This is guaranteed to be unique due to the fact that DTrace
locks every time it creates a new provider.

Following that, each of the providers from VM{i, j} would get advertised to it's
corresponding virtualization host, VM{i}. VM{i} would then generate a UUIDv5 for
each of the providers that were advertised from VM{i, j}. The namespace name
that could be used is the name of the VM. This guarantees the uniqueness of each
UUIDv5 generated on VM{i}.

Furthermore, each of the VMs, VM{i} would then advertise it's
providers(including the providers that were advertised from the nested VMs,
VM{i, j}) to P{x}. P{x} would in the same fashion generate the UUIDv5 and
finally, advertise to H, which would then have all the providers from different
machines. The difference in the case of P{x} advertising to H is that the VM
name could not be used, because in this case P{x} is a bare metal machine
connected through the network to H, to which H has access to. One could use the
public IP address(assuming no anycast)/hostname and/or port here.

In order to be able to identify these different machines, there two UUIDs would
need to be stored in the dtrace_provider struct. Namely, a namespace-local UUID
generated on the host machine and the provider UUID that was generated on the
machine that advertised the provider, so that the graph could then be traversed.

This would form a DDAG in the direction of tracing information flow from the
perspective of VM{i, j}. That means that H would get information from VM{i, j},
but there should be no way that VM{i, j} gets any information from H in terms of
data that is disclosed local to H. H could identify exactly which instance has
fired the probe. 

Another DDAG would be formed in the opposite direction, which would be used to
instruct other instances what to do. These actions could be DTrace destructive
actions, asking for identification of a certain machine and similar things. It
is important that this indeed is a DDAG, as there should be no possiblity for
this request to circle back around to the host.

Additionally, in case of conflicts, UUID pocketing could be employed and simply
store the identifying information in that form.

This approach requires the restructuring of the DTrace Provider-to-Framework
API. Namely, there needs to be a way to tell DTrace what instance is being
registered, what instance a probe is firing in and a way to index them. This can
be made backwards-compatible. Consider the following example of ensuring that
there are no changes that need to be made in the existing providers for correct
operation of DTrace:

dtrace_register() becomes dtrace_distributed_register(), where the former is
implemented with the latter by simply passing in the instance as "host".

Merits:

- Allows for a concise way of storing the identifying information on the host,
  allowing for DTrace operations such as dtrace_register(), dtrace_probe() to
  operate in a similar fashion as they do now with instance-awareness included.
  These operations could be implemented very efficiently.
- Easily scalable to an arbitrary amount of nodes

Flaws:

- The instances need to be trusted. There is room for malicious operation of
  these instances in the proposed approach if the deployment is arbitrary.
- While the existing DTrace operations can be performed efficiently, there is
  an accumulation of the instructions in the operations, resulting in a larger
  probe effect. This might prove problematic for some critical tasks and add
  complexity to DTrace.

Possible resolution:

- For virtual machines, VMI could be emplyed. This could help verify whether or
  not the virtual machines are operating in a non-malicious manner.

Many of these things are subject to change. This approach has mainly evolved
from the goal of tracing virtual machines with DTrace through bhyve. The details
on how the interoperability between the DTrace instances would be implemented
have been intentionally left out, as it is not the scope of this RFC
email(though I am more than willing to provide the information on the side of
virtual machines should it be needed).

-- 
Best regards,
Domagoj Stolfa.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-dtrace/attachments/20161230/1a0f52ee/attachment.sig>