Skip to content

Latest commit

 

History

History
322 lines (261 loc) · 22.8 KB

README.md

File metadata and controls

322 lines (261 loc) · 22.8 KB

FlowTracker

Track data flowing through Java programs, gain new understanding at a glimpse.

FlowTracker is a Java agent that tracks how a program reads, manipulates, and writes data. By watching a program run, it can show what file and network I/O happened, but more importantly connecting its inputs and outputs to show where its output came from. This helps you understand what any Java program's output means and why it wrote it.

This proof-of-concept explores what insights we get by looking at program behaviour from this perspective.

Demo

Spring PetClinic is a demo application for the Spring framework. To demonstrate FlowTracker's abilities, we let it observe PetClinic handling an HTTP request and generating an HTML page based on a template and data from a database. You can use this demo in your browser, without installing anything. Open the FlowTracker PetClinic demo, or watch the video below.

petclinic.mp4

You see the HTTP response that FlowTracker saw PetClinic send over the network. Click on a part of the contents of the HTTP response to see in the bottom view where that part came from. You can select another tracked origin/input or sink/output in the tree on the left (or bottom left button on mobile).

Exploring this HTTP response, we navigate through multiple layers of the software stack:

  • HTTP handling FlowTracker shows what code produced what output. Click on "HTTP/1.1" or the HTTP headers. You see that this part of the response was generated by apache coyote (classes in the org.apache.coyote package), pointing you to where exactly each header came from.
  • Thymeleaf templates FlowTracker shows how the input the program reads (the HTML templates) corresponds to the output. Click on an HTML tag name, like "html" or "head". You see the layout.html file, where this part of the HTML page comes from. If you click on layout.html, and then on the colorful + button at the bottom, then everything coming from that file will be marked in the same color. Scrolling down you'll then notice part of the response comes from a different file, ownerDetails.html. Click on a < or > to see that those characters were written by the Thymeleaf templating library.
  • Database The HTML page contains a table with information that comes from the database. Clicking on George in that table does not only show that that value came from the database. It goes further: it traced it all the way back to the SQL script that inserted that value in the database in first place.

In that demo, the tracking up to the SQL script works because it was using an in-memory database. The database content never left the JVM, so FlowTracker could fully keep track of it. When we run the same demo but with a mysql database, then we track those values up to the database connection: we see the SQL query sent before to produce them, and details of how the mysql jdbc driver talks to the database. See FlowTracker PetClinic mysql demo. Notice that FlowTracker intercepts the decrypted contents sent over the SSL connection to the database.

This Spring PetClinic demo is just an example. FlowTracker does not depend on your application using any particular framework or library.

Another demo, showing how by watching the java compiler, FlowTracker helps you understand the format of the generated class file and the bytecode in it: javac demo, video.

Usage

Warning: In its current state, FlowTracker is closer to a proof of concept than production ready. It has proven itself to work well on a number of example programs, but it is not going to work well for everything, your mileage may vary. Also be aware that it adds a lot of overhead, making programs run much slower.

Download the FlowTracker agent jar from the Github releases pages (flowtracker-*.jar under "Assets"). Add the agent to your java command line: -javaagent:path/to/flowtracker.jar. Disable some JVM optimizations that disrupt flowtracker by also adding the output of java -jar flowtracker.jar jvmopts to the command line. By default, FlowTracker starts a webserver on port 8011, so open http://localhost:8011/ in your browser.

For more detailed instructions, including configuration options, see USAGE.md.

How it works internally

Short version

FlowTracker is an instrumenting agent. The agent injects its code into class files (bytecode) when the JVM loads them. That code maintains a mapping of in-memory data to its origin, while the program reads, passes around, and writes data. The focus is on tracking textual and binary data (like Strings, char and byte arrays), not on numerical, structured or computed data.

This achieved with a combination of:

  • Replacing some calls to JDK methods with calls to FlowTracker's version of those methods.
  • Injecting code into key places in the JDK, mostly to track input and output.
  • Dataflow analysis and deeper instrumentation within methods to track local variables and values on the stack.
  • Adding code before and after method invocations, and at the start and end of invoked methods, to track method arguments and return values using ThreadLocals.

Data model: Trackers

Core classes and concepts of FlowTracker's data model:

  • Tracker: holds information about a tracked object's content and source:
    • content: the data that passed through them. e.g. all bytes passed through an InputStream or OutputStream.
    • source: associate ranges of its content to their source ranges in other trackers. For example, for the bytes of a String that could be pointing to the range of the tracker of the FileInputStream that the String was read from; telling us from which file and where exactly in that file it came from.
  • TrackerRepository: holds a large global Map<Object, Tracker> that associates interesting objects with their tracker.
  • TrackerPoint: Pointer to a position in a tracker, representing a single primitive value being tracked, e.g. the source of one byte.

Basic instrumentation

To keep Trackers up-to-date, our instrumentation inserts calls to hook methods in flowtracker when some specific JDK methods are being called.

The simplest example of that is for System.arraycopy. We intercept that on the caller's side: Calls to java.lang.System.arraycopy are replaced with calls to com.coekie.flowtracker.hook.SystemHook.arraycopy. For this and other instrumentation, we use the ASM bytecode manipulation library. In SystemHook we call the real arraycopy, get the Trackers of the source and destination arrays from the TrackerRepository, and update the target Tracker to point to its source.

For example, given this code:

char[] abc = ...; char[] abcbc = new char[5];
System.arraycopy(abc, 0, abcbc, 0, 3);
System.arraycopy(abc, 1, abcbc, 3, 2);

This gets rewritten to the following. Note that instrumentation happens on bytecode, not source code, but we show equivalent source code here because that's much easier to read.

char[] abc = ...; char[] abcbc = new char[5];
SystemHook.arraycopy(abc, 0, abcbc, 0, 3);
SystemHook.arraycopy(abc, 1, abcbc, 3, 2);

After executing this, the tracker for abcbc would look like: {[0-2]: {tracker: abcTracker, sourceIndex: 0, length: 3}, [3-4]: {tracker: abcTracker, sourceIndex: 1, length: 2}}

That was an example of a hook on the caller side. But most calls to hook methods are added on the callee side, inside the methods in the JDK. For example take FileInputStream.read(byte[]), which reads data from a File and stores the result in the provided byte[]. We add the call to our hook method (FileInputStreamHook.afterReadByteArray) at the end of the FileInputStream.read(byte[]) method. We have our own instrumentation micro-framework for that, driven by annotations, implemented using ASM's AdviceAdapter.

That way we add hooks to a number of classes in the JDK responsible for input and output, such as java.io.FileInputStream, java.io.FileOutputStream, and internal classes like sun.nio.ch.FileChannelImpl, sun.nio.ch.IOUtil, sun.nio.ch.NioSocketImpl and more.

Implementation: SystemHook, FileInputStreamHook, and other classes in the hook package.

Primitive values, dataflow analysis

A bigger challenge is tracking primitive values. Consider this example:

byte[] x; byte[] y;
// ...
byte b = x[1];
// ...
y[2] = b;

When that code is executed, we would need to update the Tracker of y, to remember that the value at index 2 comes from the value at index 1 in x. If those had been String[]s and b was a String instead of a byte, then we wouldn't need to modify code like this, because the TrackerRepository would know what the Tracker of the String is, and keeps that association no matter how that String object is passed around. But the TrackerRepository can't keep a mapping of primitive values like bytes to Trackers, because primitive values don't have an identity: any Map having a byte as key would mix up different occurrences of the same byte. Instead, we store the association of b to its tracker in a local variable in the method itself. The code gets rewritten to roughly something like this:

byte[] x; byte[] y;
// ...
byte b = x[1];
TrackerPoint bTracker = ArrayHook.getElementTracker(x, 1);
// ...
y[2] = b;
ArrayHook.setElementTracker(y, 2, bTracker);

To do that FlowTracker needs to understand how exactly values flow through a method. We build upon ASM's analysis support to analyze the code (symbolic interpretation). That way we construct a model of where values in local variables and on the stack come from at every point in the method, and where they end up.

This is implemented in

  • FlowValue and its subclasses (e.g. ArrayLoadValue) that model where values come from, and can generate the instructions that create the TrackerPoints that point to that source. A particularly interesting one is MergedValue, which handles situations where because of control flow (e.g. if-statements, loops) a value can come from multiple possible places.
  • FlowInterpreter: extension of ASM's Interpreter, interprets bytecode instructions, creates the appropriate FlowValues.
  • Store and its subclasses (e.g. ArrayStore) that represent places that FlowValues go to, that consume the TrackerPoints.
  • FlowTransformer: drives the whole analysis and instrumentation process. See its docs for a more detailed walkthrough of how this all fits together.

We don't track the source of all primitive values. The focus is on byte and char values, and to a lesser extent ints and longs.

Method invocations

The dataflow analysis from the previous section is limited to handling flow of primitive values within a single method. Those values also flow into other methods, as arguments and return values of method invocations. We model that in Invocation, which stores PointTrackers for arguments and return values. The Invocation is stored in a ThreadLocal just before a method invocation, and retrieved at the start of the implementation of the method.

For example, take this code passing a primitive value to a "write" method:

void caller() {
  byte b = ...;
  out.write(b);  
}

...

class MyOutputStream {
  void write(byte value) {
    ... // do something with value
  }
}

To get the TrackerPoint of b into the write method, the code is instrumented like this:

void caller() {
  byte b = ...;
  TrackerPoint bTracker = ...;
  Invocation.create("write(byte)")
    .setArg(0, bTracker)
    // this puts the Invocation in the ThreadLocal
    .calling(); 
  out.write(b);  
}

...

class MyOutputStream {
  void write(byte value) {
    // this extracts the Invocation from the ThreadLocal
    Invocation invocation = Invocation.start("write(byte)");
    TrackerPoint valueTracker = invocation.getArg0();
    ... // do something with value & valueTracker
  }
}

Implementation: Invocation, InvocationArgStore, InvocationArgValue, InvocationReturnStore, InvocationReturnValue, InvocationOutgoingTransformation, InvocationIncomingTransformation

Code as origin

There are two main types of tracked origins of data. There is I/O, which is tracked as explained in the "Basic instrumentation" section. And there are values coming from the code itself, such as primitive and String constants ('a', "abc"). For those, we create a tracker for each class (a ClassOriginTracker), that contains a textual representation of that class and the constants that it references. When those constants are referenced, we then point the trackers for those values at the corresponding place in that textual representation. That is as if our textual representation of the class is where the values were read from. That makes our model for constants look very similar to how we model I/O.

For example for this code:

class MyClass {
  void myMethod() {
    char a = 'x';
    ... // do something with a
  }
}

We generate a ClassOriginTracker with content that looks like this:

class MyClass
void myMethod():
  (line 3): x

And the code gets rewritten to something like:

class MyClass {
  void myMethod() {
    char a = 'x';
    TrackerPoint aTracker = ConstantHook.constantPoint(
      1234 /* id for MyClass*/,
      81 /* offset of 'x' in the ClassOriginTracker content */);
    
    ... // do something with a and aTracker
  }
}

For performance reasons, we actually use ConstantDynamic (JEP 309) to ensure that the constantPoint methods are only invoked once instead of every time myMethod executes.

Implementation: ClassOriginTracker, ConstantValue, ConstantsTransformation

String literals

For String literals, we create a new copy of the String, and associate the content of the String (the byte[] in String.value) with the ClassOriginTracker. A statement like String s = "abc"; gets rewritten to String s = StringHook.constantString("abc", 1234, 81);. This breaks a guarantee that the JVM normally provides, that all String constants are interned: all occurrences of the same String constant should refer to the same instance. Most code doesn't actually rely on String interning, but code that does would get broken by our instrumentation. We avoid most of the issues that could cause because:

  • We use ConstantDynamic, so the same String literal (at the same line of code) executed multiple times still gives the same instance every time.
  • We rewrite some stringA == stringB expressions as Objects.equals(stringA, stringB), so that from some points of view they look like the same instance again.
  • We disable tracking of String literals in some packages (such as java.lang.*). This is configurable (see breakStringInterning in USAGE.md).

Implementation: StringLdc, ConstantsTransformation StringComparison

Fallback for untracked values

FlowTracker does not track every value in the program. That is partly because of performance concerns, partly because we just haven't implemented everything we would want, and partly because it just doesn't seem relevant or would require building a more complicated data model where values can come from a combination of places (e.g. calculated numerical values). When values that are not being tracked end up in places where we do want to start tracking them, then we treat them similar to constants: we add a link to the ClassOriginTracker, to where they became tracked, represented there as "<?>". For example, lengths of arrays are values that are not tracked, so suppose a method calls write(array.length), then in that Invocation we pass a PointTracker that refers to that place in the code where the write method is called.

In practice, the result of that is when you look at some output, particularly if it's in a binary format, while you don't see where a value originally came from, you can often still quickly decipher what it means (e.g. "that value just before that tracked String points to write(array.length), so that must be the length of that String").

More

More topics about the implementation that could be talked about, but didn't make the cut. Most of this is documented in the code, if you really want to learn more:

  • Details of MergedValue: the hardest part of dataflow analysis, how to instrument code to keep track of values through branches and loops.
  • How we hook String concatenation through its indification (JEP 280) by adding hooks to the MethodHandles returned by StringConcatFactory in StringConcatenation and StringConcatFactoryHook.
  • Finding the source code, decompiling with Vineflower, associating bytecode with source code lines. See SourceCodeGenerator, VineflowerCodeGenerator, AsmCodeGenerator.
  • The ClassLoader setup. How we avoid dependencies on the bootclasspath colliding with the app, without shading (because that makes debugging annoying) and without nested jars. Development setup that allows changing an agent without repackaging it, to ensure fast development cycles. See FlowTrackerAgent, DevAgent, SpiderClassLoader.
  • How class loading can intervene with tracking of method invocations, and how we work around that. See SuspendInvocationTransformer, Invocation#suspend. Interesting problem, simple solution kinda obvious in retrospect.
  • Tracking of primitive values stored in fields: FieldRepository, FieldStore, FieldValue. Just more of the same, nothing surprising.
  • How we add comments into instrumented code to help understand and debug instrumentation. ASM/Bytecode doesn't support comments, but that won't stop me!
  • Avoiding circularity problems when instrumenting core JDK classes. I eat ClassCircularityErrors and StackOverFlowErrors for breakfast.
  • Front-end: Web server with jetty, JAX-RS. Web UI built with Svelte. Beautiful UI design by... nobody.
  • Our optimized ThreadLocal abomination in ContextSupplier. On second thought, never mind, you don't want to know.