Spurious "Failed to identify trace type" error messages #247

jlewi · 2024-09-18T19:38:23Z

Our logs are showing error messages like the following.

{
  "insertId": "p01ulcfteojll",
  "jsonPayload": {
    "caller": "analyze/analyzer.go:328",
    "function": "github.com/jlewi/foyle/app/pkg/analyze.(*Analyzer).combineAndCheckpoint",
    "targetTraceId": "8d588abbd5b62374306f3f664f339922",
    "severity": "error",
    "errorVerbose": "Failed to identify trace type\ngithub.com/jlewi/foyle/app/pkg/analyze.combineEntriesForTrace\n\t/Users/jlewi/git_foyle/app/pkg/analyze/analyzer.go:741\ngithub.com/jlewi/foyle/app/pkg/analyze.(*Analyzer).buildTrace\n\t/Users/jlewi/git_foyle/app/pkg/analyze/analyzer.go:523\ngithub.com/jlewi/foyle/app/pkg/analyze.(*Analyzer).combineAndCheckpoint\n\t/Users/jlewi/git_foyle/app/pkg/analyze/analyzer.go:327\ngithub.com/jlewi/foyle/app/pkg/analyze.(*Analyzer).processLogFile\n\t/Users/jlewi/git_foyle/app/pkg/analyze/analyzer.go:307\ngithub.com/jlewi/foyle/app/pkg/analyze.(*Analyzer).handleLogFileEvents.func1\n\t/Users/jlewi/git_foyle/app/pkg/analyze/analyzer.go:205\ngithub.com/jlewi/foyle/app/pkg/analyze.(*Analyzer).handleLogFileEvents\n\t/Users/jlewi/git_foyle/app/pkg/analyze/analyzer.go:218\nruntime.goexit\n\t/opt/homebrew/opt/go/libexec/src/runtime/asm_arm64.s:1222",
    "time": 1726687817.459125,
    "message": "Error building trace",
    "error": "Failed to identify trace type"
  },
  "resource": {
    "type": "project",
    "labels": {
      "project_id": "foyle-dev"
    }
  },
  "timestamp": "2024-09-18T19:30:17.459125041Z",
  "severity": "ERROR",
  "logName": "projects/foyle-dev/logs/foyle",
  "receiveTimestamp": "2024-09-18T19:30:18.112128745Z"
}

If we search for that trace we see that it has an AgentGenerate message.

{
  "insertId": "3weoyfeyrt2a",
  "jsonPayload": {
    "contextId": "01J7Z2GJWGJKHTKD4N2X5RJ25X",
    "severity": "info",
    "message": "Agent.Generate",
    "time": 1726545957.47473,
    "caller": "agent/agent.go:105",
    "evalMode": false,
    "streamTraceId": "f8f2b5f7c233c88c1114c0a0ad623223",
    "request": {
      "doc": {
        "blocks": [
          {
            "outputs": [],
            "trace_ids": [],
            "language": "markdown",
            "contents": "# Controlling Anthropic Costs",
            "id": "",
            "kind": "MARKUP"
          },
          {
            "language": "markdown",
            "kind": "MARKUP",
            "contents": "* Why are we spending so much on Anthropic?\n* How many LLM calls are we making?\n* We are logging data to Honeycomb but\n\n   * I'm not seeing any traces for Generate; only StreamGenerate\n   * I'm not seeing any traces for Anthropic requests\n\n* I tried setting the DefaultTransport for the client but I'm still not seeing events.\n* Interestingly we are getting events for\n   google.logging.v2.LoggingServiceV2/WriteLogEntries\n* I tried skipping the retryable client and it still didn't work.\n* I think Google Cloud Logging is using gRPC\n* The OTEL roundtripper is being invoked\n* I checked the Anthropic client and the body is being closed?\n* When I directly called generate it worked\n\n   * I think the problem is when I call GenerateCells from StreamGenerate there is no context\n\n* I take that back;\n\n   * Here's an event; https://ui.honeycomb.io/autobuilder/environments/prod/datasets/foyle/result/aFeSuUieHzs/trace/qwdytg1Mdgp?fields[]=s_name&fields[]=s_serviceName&span=12449317177fbeee\n   * It looks like I was looking at the wrong root event\n   * Here's where we generate a new trace; https://github.com/jlewi/foyle/blob/5ab52e5ed4e97437fd7e42dfe2d88ef07dfbd864/app/pkg/agent/agent.go#L471",
            "trace_ids": [],
            "outputs": [],
            "id": ""
          },
          {
            "contents": "## Generate a Honeycomb Query",
            "id": "",
            "trace_ids": [],
            "language": "markdown",
            "outputs": [],
            "kind": "MARKUP"
          },
          {
            "outputs": [],
            "trace_ids": [],
            "contents": "* Generate a honeycomb query to find the traces named CreateCompletion\n* Use the hccli tool to run the query and print the link",
            "kind": "MARKUP",
            "id": "",
            "language": "markdown"
          },
          {
            "trace_ids": [],
            "id": "01J71GB2FJS4GMAQAM5HY7DTG8",
            "contents": "# Generate the Honeycomb query\nQUERY='{\"calculations\":[{\"op\":\"COUNT\"}],\"filters\":[{\"column\":\"name\",\"op\":\"=\",\"value\":\"CreateCompletion\"}],\"time_range\":86400}'\n\n# Run the query using hccli\nhccli querytourl --dataset foyle --query \"$QUERY\" --base-url https://ui.honeycomb.io/autobuilder/environments/prod",
            "language": "bash",
            "kind": "CODE",
            "outputs": []
          },
          {
            "contents": "Generate a query to count llm.input_tokens",
            "trace_ids": [],
            "language": "markdown",
            "kind": "MARKUP",
            "outputs": [],
            "id": ""
          },
          {
            "id": "01J71HWHYNVHXN9ATDJDCT480W",
            "contents": "QUERY='{\"calculations\":[{\"op\":\"COUNT\"},{\"op\":\"SUM\",\"column\":\"llm.input_tokens\"},{\"op\":\"SUM\",\"column\":\"llm.output_tokens\"}],\"filters\":[{\"column\":\"name\",\"op\":\"=\",\"value\":\"CreateCompletion\"}],\"time_range\":86400}'\nhccli querytourl --dataset foyle --query \"$QUERY\" --base-url https://ui.honeycomb.io/autobuilder/environments/prod",
            "kind": "CODE",
            "outputs": [],
            "trace_ids": [],
            "language": "bash"
          },
          {
            "trace_ids": [],
            "contents": "* So two questions we want to answer are\n\n   * How many LLM calls are we making per cell generation?\n   * How many Input/Output tokens are we consuming per cell?\n\n* Traces have a complete ID field so we can use that to group by cell",
            "kind": "MARKUP",
            "language": "markdown",
            "id": "",
            "outputs": []
          },
          {
            "contents": "QUERY='{\"calculations\":[{\"op\":\"COUNT\", \"alias\": \"Total_LLM_Calls\"},{\"op\":\"SUM\",\"column\":\"llm.input_tokens\", \"alias\": \"Total_Input_Tokens\"},{\"op\":\"SUM\",\"column\":\"llm.output_tokens\", \"alias\": \"Total_Output_Tokens\"}],\"filters\":[{\"column\":\"name\",\"op\":\"=\",\"value\":\"CreateCompletion\"}],\"time_range\":86400}'\nhccli querytourl --dataset foyle --query \"$QUERY\" --base-url https://ui.honeycomb.io/autobuilder/environments/prod",
            "language": "bash",
            "outputs": [],
            "kind": "CODE",
            "trace_ids": [],
            "id": "01J71JP3RGMMVRP982YA47HNT7"
          },
          {
            "outputs": [],
            "language": "markdown",
            "id": "",
            "contents": "* The sum of the llm tokens is showing zero\n* I think this is because the llm tokens and output tokens are set on the Complete span\n* So we need to query for the Complete Span",
            "kind": "MARKUP",
            "trace_ids": []
          },
          {
            "id": "01J71JT2H2R4TZ0ZAVSVRH87D8",
            "trace_ids": [],
            "kind": "CODE",
            "language": "bash",
            "outputs": [
              {
                "items": [
                  {
                    "mime": "stateful.runme/terminal",
                    "text_data": "{\n\t\"type\": \"stateful.runme/terminal\",\n\t\"output\": {\n\t\t\"runme.dev/id\": \"01J71JT2H2R4TZ0ZAVSVRH87D8\",\n\t\t\"content\": \"\",\n\t\t\"initialRows\": 10,\n\t\t\"enableShareButton\": true,\n\t\t\"isAutoSaveEnabled\": false,\n\t\t\"isPlatformAuthEnabled\": false,\n\t\t\"isSessionOutputsEnabled\": true,\n\t\t\"backgroundTask\": true,\n\t\t\"nonInteractive\": false,\n\t\t\"interactive\": true,\n\t\t\"fontSize\": 16,\n\t\t\"fontFamily\": \"Menlo, Monaco, 'Courier New', monospace\",\n\t\t\"rows\": 10,\n\t\t\"cursorStyle\": \"bar\",\n\t\t\"cursorBlink\": true,\n\t\t\"cursorWidth\": 1,\n\t\t\"smoothScrollDuration\": 0,\n\t\t\"scrollback\": 1000,\n\t\t\"closeOnSuccess\": true\n\t}\n}"
                  },
                  {
                    "text_data": "",
                    "mime": "application/vnd.code.notebook.stdout"
                  }
                ]
              }
            ],
            "contents": "QUERY='{\n  \"calculations\": [\n    {\"op\": \"COUNT\", \"alias\": \"LLM_Calls_Per_Cell\"},\n    {\"op\": \"SUM\", \"column\": \"llm.input_tokens\", \"alias\": \"Input_Tokens_Per_Cell\"},\n    {\"op\": \"SUM\", \"column\": \"llm.output_tokens\", \"alias\": \"Output_Tokens_Per_Cell\"}\n  ],\n  \"filters\": [\n    {\"column\": \"name\", \"op\": \"=\", \"value\": \"Complete\"}\n  ],\n  \"breakdowns\": [\"trace.trace_id\"],\n  \"time_range\": 86400,\n  \"order_by\": [{\"op\": \"COUNT\", \"order\": \"descending\"}]\n}'\n\nhccli querytourl --dataset foyle --query \"$QUERY\" --base-url https://ui.honeycomb.io/autobuilder/environments/prod --open"
          },
          {
            "outputs": [],
            "id": "",
            "language": "markdown",
            "kind": "MARKUP",
            "trace_ids": [],
            "contents": "* So Input tokens is quite large O(10K-15K) output is O(100-500)\n* I suspect the variability of the input tokens is due to the context coming in from RAG\n* Anthropic is \\$3 per 1M input tokens or about $.03 per 10K toens\n* Output is about .0015 per 100 tokens at $15 per 1M tokens\n* Group the Complete traces by the contextID field and count them"
          },
          {
            "contents": "# Query to get the top 100 traces by input token usage\nQUERY='{\n  \"calculations\": [\n    {\"op\": \"COUNT\", \"alias\": \"LLM_Calls_Per_Cell\"}\n  ],\n  \"filters\": [\n    {\"column\": \"name\", \"op\": \"=\", \"value\": \"Complete\"}\n  ],\n  \"breakdowns\": [\"contextID\"],\n  \"time_range\": 86400,  \n  \"limit\": 100\n}'\n\nhccli querytourl --dataset foyle --query \"$QUERY\" --base-url https://ui.honeycomb.io/autobuilder/environments/prod",
            "id": "01J71K6Q383MQEQ32A592Y8S3W",
            "kind": "CODE",
            "language": "bash",
            "trace_ids": [],
            "outputs": []
          },
          {
            "kind": "MARKUP",
            "language": "markdown",
            "contents": "* Looks like we are at 4-10 calls per contextID\n* 10 calls that's about $.30 per cell\n* copilot is $10 per month\n* That's approximately 333 cells per month\n* My token usage right now is between 1million and 4million tokens per day and 20Kto50K output\n* So right now I'm roughly 10x too expensive",
            "trace_ids": [],
            "outputs": [],
            "id": ""
          },
          {
            "outputs": [],
            "kind": "MARKUP",
            "contents": "## Acceptance events",
            "language": "markdown",
            "id": "",
            "trace_ids": []
          },
          {
            "trace_ids": [],
            "id": "",
            "kind": "MARKUP",
            "contents": "* Are we logging any LOG events\n* I'm not seeing any traces for LogEvents\n* Lets query the logs using JQ to check for any LogEvent messages",
            "outputs": [],
            "language": "markdown"
          },
          {
            "trace_ids": [],
            "kind": "CODE",
            "contents": "LASTLOG=$(ls -t ~/.foyle/logs/raw/*.json | head -n 1)\necho ${LASTLOG}\njq 'select(has(\"eventType\"))' \"$LASTLOG\"",
            "id": "01J7279EYWEDWGQJPM4DW3136D",
            "language": "bash",
            "outputs": []
          },
          {
            "contents": "* So indeed we have no LOGEVENTS so maybe we need to install and build a newer version of the frontend\n* Looks like I had 3.7.3 installed; probably auto upgraded to non dev build. So I built and installed the latest from main.\n* v3.7.4-edge.1-jlewi\n* So lets rerun the query to check logs for even types",
            "language": "markdown",
            "kind": "MARKUP",
            "id": "",
            "outputs": [],
            "trace_ids": []
          },
          {
            "kind": "CODE",
            "id": "01J7281Q3QHJMGH6NH7J6XT4J5",
            "contents": "LASTLOG=$(ls -t ~/.foyle/logs/raw/*.json | head -n 1)\necho \"Using log file: ${LASTLOG}\"\njq 'select(has(\"eventType\"))' \"${LASTLOG}\"",
            "trace_ids": [],
            "outputs": [],
            "language": "bash"
          },
          {
            "contents": "* Success!\n* Now generate a honeycomb query to look for traces LogEvents RPC",
            "id": "",
            "outputs": [],
            "language": "markdown",
            "kind": "MARKUP",
            "trace_ids": []
          },
          {
            "outputs": [],
            "language": "bash",
            "contents": "QUERY='{\n  \"calculations\": [\n    {\"op\": \"COUNT\", \"alias\": \"Event_Count\"}\n  ],\n  \"filters\": [\n     {\"column\": \"name\", \"op\": \"=\", \"value\": \"LogEvents\"}\n  ],\n  \"breakdowns\": [\"eventType\"],\n  \"time_range\": 86400,\n  \"order_by\": [{\"op\": \"COUNT\", \"order\": \"descending\"}]\n}'\n\nhccli querytourl --dataset foyle --query \"$QUERY\" --base-url https://ui.honeycomb.io/autobuilder/environments/prod",
            "trace_ids": [],
            "id": "01J7282S2VFGE8DSYHY5NV0044",
            "kind": "CODE"
          },
          {
            "contents": "echo 'https://ui.honeycomb.io/autobuilder/environments/prod/datasets/foyle?query={\"breakdowns\":[\"eventType\"],\"calculations\":[{\"op\":\"COUNT\"}],\"filters\":[{\"op\":\"=\",\"column\":\"name\",\"value\":\"LogEvents\"}],\"time_range\":86400}'",
            "outputs": [],
            "kind": "CODE",
            "language": "bash",
            "trace_ids": [],
            "id": "01J7287NENGHATDN4RGN1VSA3C"
          },
          {
            "id": "",
            "trace_ids": [],
            "kind": "MARKUP",
            "language": "markdown",
            "contents": "* Now generate a honeycomb query to look at traces named LogEvent which I just added\n* We want to veriy the spans got set",
            "outputs": []
          },
          {
            "contents": "QUERY='{\n  \"calculations\": [\n    {\"op\": \"COUNT\", \"alias\": \"Event_Count\"}\n  ],\n  \"filters\": [\n    {\"column\": \"name\", \"op\": \"=\", \"value\": \"LogEvent\"}\n  ],\n  \"breakdowns\": [\"eventType\"],\n  \"time_range\": 86400,\n  \"order_by\": [{\"op\": \"COUNT\", \"order\": \"descending\"}]\n}'\n\nhccli querytourl --dataset foyle --query \"$QUERY\" --base-url https://ui.honeycomb.io/autobuilder/environments/prod",
            "kind": "CODE",
            "outputs": [],
            "id": "01J7298DV5Q01TCMV9H8SBM2Q0",
            "trace_ids": [],
            "language": "bash"
          },
          {
            "trace_ids": [],
            "outputs": [],
            "language": "markdown",
            "contents": "## Measure Efficacy",
            "id": "",
            "kind": "MARKUP"
          },
          {
            "outputs": [],
            "trace_ids": [],
            "contents": "* So to measure efficacy we want to look at the number of accepted events",
            "language": "markdown",
            "id": "",
            "kind": "MARKUP"
          },
          {
            "language": "bash",
            "kind": "CODE",
            "outputs": [
              {
                "items": [
                  {
                    "mime": "stateful.runme/terminal",
                    "text_data": "{\n\t\"type\": \"stateful.runme/terminal\",\n\t\"output\": {\n\t\t\"runme.dev/id\": \"01J729C3MV620S8ND840TZ16G2\",\n\t\t\"content\": \"\",\n\t\t\"initialRows\": 10,\n\t\t\"enableShareButton\": true,\n\t\t\"isAutoSaveEnabled\": false,\n\t\t\"isPlatformAuthEnabled\": false,\n\t\t\"isSessionOutputsEnabled\": true,\n\t\t\"backgroundTask\": true,\n\t\t\"nonInteractive\": false,\n\t\t\"interactive\": true,\n\t\t\"fontSize\": 16,\n\t\t\"fontFamily\": \"Menlo, Monaco, 'Courier New', monospace\",\n\t\t\"rows\": 10,\n\t\t\"cursorStyle\": \"bar\",\n\t\t\"cursorBlink\": true,\n\t\t\"cursorWidth\": 1,\n\t\t\"smoothScrollDuration\": 0,\n\t\t\"scrollback\": 1000,\n\t\t\"closeOnSuccess\": true\n\t}\n}"
                  },
                  {
                    "mime": "application/vnd.code.notebook.stdout",
                    "text_data": ""
                  }
                ]
              }
            ],
            "contents": "QUERY='{\n  \"calculations\": [\n    {\"op\": \"COUNT\", \"alias\": \"Event_Count\"}\n  ],\n  \"filters\": [\n    {\"column\": \"name\", \"op\": \"=\", \"value\": \"LogEvent\"},\n    {\"column\": \"eventType\", \"op\": \"=\", \"value\": \"ACCEPTED\"}\n  ],\n  \"time_range\": 86400,\n  \"order_by\": [{\"op\": \"COUNT\", \"order\": \"descending\"}]\n}'\n\nhccli querytourl --dataset foyle --query \"$QUERY\" --base-url https://ui.honeycomb.io/autobuilder/environments/prod --open",
            "id": "01J729C3MV620S8ND840TZ16G2",
            "trace_ids": []
          },
          {
            "trace_ids": [],
            "kind": "MARKUP",
            "outputs": [],
            "id": "",
            "language": "markdown",
            "contents": "* Interesting it looks like when I paste the URL into honeycomp `ACCEPTED` ends up being lowercased and as a result there are no matches to the query.\n* Is that a bug in the honeycomb UI?"
          },
          {
            "contents": "## Controlling costs\n\n* Input tokens is still very large.\n* Is that due to multiple requests per cell?\n* How would we measure the number of requests generated for each cell?\n\n",
            "id": "",
            "kind": "MARKUP",
            "trace_ids": [],
            "language": "markdown",
            "outputs": []
          },
          {
            "trace_ids": [],
            "language": "markdown",
            "kind": "MARKUP",
            "outputs": [],
            "id": "",
            "contents": "## Don't generate completions for code cells\n\n* As one way to control costs, we could avoid generating completions for code cells."
          },
          {
            "trace_ids": [],
            "contents": "QUERY='{\n  \"calculations\": [\n    {\"op\": \"COUNT\", \"alias\": \"Request_Count\"},\n    {\"op\": \"AVG\", \"column\": \"llm.input_tokens\", \"alias\": \"Avg_Input_Tokens\"},\n    {\"op\": \"P90\", \"column\": \"llm.input_tokens\", \"alias\": \"P90_Input_Tokens\"},\n    {\"op\": \"MAX\", \"column\": \"llm.input_tokens\", \"alias\": \"Max_Input_Tokens\"},\n    {\"op\": \"SUM\", \"column\": \"llm.input_tokens\", \"alias\": \"Total_Input_Tokens\"}\n  ],\n  \"filters\": [\n    {\"column\": \"name\", \"op\": \"=\", \"value\": \"Complete\"}\n  ],\n  \"breakdowns\": [\"contextID\"],\n  \"time_range\": 86400,\n  \"order_by\": [{\"column\": \"Total_Input_Tokens\", \"order\": \"descending\"}],\n  \"limit\": 100\n}'\n\nhccli querytourl --dataset foyle --query \"$QUERY\" --base-url https://ui.honeycomb.io/autobuilder/environments/prod --open",
            "kind": "CODE",
            "outputs": [],
            "id": "01J7Z2G58QE24ZZ2ZGKBFWH5TA",
            "language": "bash"
          }
        ]
      }
    },
    "function": "github.com/jlewi/foyle/app/pkg/agent.(*Agent).Generate",
    "traceId": "8d588abbd5b62374306f3f664f339922"
  },
  "resource": {
    "type": "project",
    "labels": {
      "project_id": "foyle-dev"
    }
  },
  "timestamp": "2024-09-17T04:05:57.474730014Z",
  "severity": "INFO",
  "logName": "projects/foyle-dev/logs/foyle",
  "receiveTimestamp": "2024-09-17T04:05:57.955291283Z"
}

So combineTraces should correctly identify it as a generate trace. We periodically call combineAndCheckpoint which does the combine

foyle/app/pkg/analyze/analyzer.go

Line 307 in 5f30c46

a.combineAndCheckpoint(ctx, path, offset, traceIDs)

So I think the problem is we are calling combine for the trace but we haven't recieved all of the log messages for that trace.
This should be fine because when the log messages do arrive for that trace they will trigger a combine.
So we should probably update the code to not report an error in this case.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious "Failed to identify trace type" error messages #247

Spurious "Failed to identify trace type" error messages #247

jlewi commented Sep 18, 2024

Spurious "Failed to identify trace type" error messages #247

Spurious "Failed to identify trace type" error messages #247

Comments

jlewi commented Sep 18, 2024