tracing

tracing方案

我们基本方案,是基于service mesh (envoy打脸最终并没有) 和 jaeger 来实现的

环境安装

关于jaeger的搭建,主要参考官方文档 jaeger-k8s 来的 ,其中注意关于 五个组件的区分

1.jpg

端口号协议组件功能
5775UDPagent通过thrift的compact协议接收zipkin.thrift数据
6831UDPagent通过thrift的compact协议接收jaeger.thrift数据
6832UDPagent通过thrift的二进制协议接收jaeger.thrift数据
5778HTTPagent用于配置接口
16686HTTPquery用于UI界面
14268HTTPcollector直接接受客户端直连的jaeger.thrift
14250HTTPcollector接受model.proto
9411HTTPcollector兼容zipkin的http端点

 

其中 agentcollector 是放在 公用服务的k8s环境里面的(a01),而 jaeger web ui 是可以放在管理环境(a02)
其中数据的后端,并没有采用cassandra后端,而是使用elasticsearch(采用实体机搭建的es集群,与重要事件上报公用的es集群)

测试环境

完整一键式安装,只适用于测试

docker run -d -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \  
  -p5775:5775/udp \
  -p6831:6831/udp \
  -p6832:6832/udp \
  -p5778:5778 \
  -p16686:16686 \
  -p14268:14268 \
  -p9411:9411 \
  jaegertracing/all-in-one:latest

生产环境

https://raw.githubusercontent.com/jaegertracing/jaeger-kubernetes/master/production-elasticsearch/configmap.yml   

https://raw.githubusercontent.com/jaegertracing/jaeger-kubernetes/master/production-elasticsearch/elasticsearch.yml  

官方给出了两份yaml文件,我们略做了修改在测试环境上使用
主要是agentdaemonset修改为deployment,还有es的服务修改为真实ip 加 端口地址

使用方式

代理服务上报

在各个服务的local envoy代理开启了tracing的开关
(最终取消了这种方式)

tracing:  
  http:
    name: envoy.zipkin
    config:
      collector_cluster: jaeger
      collector_endpoint: "/api/v1/spans"
......
static_resources:  
  clusters:
  - name: jaeger
    connect_timeout: 0.25s
    type: strict_dns
    lb_policy: round_robin
    hosts:
    - socket_address:
        address: jaeger-collector
        port_value: 9411

其中clusterjaeger的地址jaeger-collector一定是k8s里面可连接的(看端口可知是http接口)

服务之间调用

需要添加egress的配置,主要服务之间调用使用出口的信息
官方的实例配置

listeners:  
.....
          tracing:
            operation_name: egress
入口

针对入口的front envoy代理还需要设置生成唯一idgeneraterequestid: true
同样,我们也是通过service mesh(服务网格)配置下发到各个服务

存储调用

这个部分可能需要我们手动开发,作为上报来使用(尽管envoy支持mysqlredis的代理,却不支持tracing)例如mysqlredis

直接上报
// main.go
package main

import (  
    "github.com/opentracing/opentracing-go"
    "github.com/opentracing/opentracing-go/ext"
    jaegercfg "github.com/uber/jaeger-client-go/config"
    "log"
    "time"
)   

func main() {  
    cfg, err := jaegercfg.FromEnv()
    if err != nil {
        // parsing errors might happen here, such as when we get a string where we expect a number
        log.Printf("Could not parse Jaeger env vars: %s", err.Error())
        return
    }

    tracer, closer, err := cfg.NewTracer()
    if err != nil {
        log.Printf("Could not initialize jaeger tracer: %s", err.Error())
        return 
    } 
    defer closer.Close()

    opentracing.SetGlobalTracer(tracer)
    // continue main()

    span := opentracing.StartSpan("test_chainhelen")
    ext.SamplingPriority.Set(span, 1)
    defer func() {
        span.Finish()
    }() 

    time.Sleep(time.Duration(2) * time.Second)
    log.Printf("main...\n")
}

export JAEGER_SERVICE_NAME=chainhelen_service  
export JAEGER_REPORTER_LOG_SPANS=true  
export JAEGER_ENDPOINT=http://127.0.0.1:14268/api/traces  
go build -o main main.go  
./main 

其中注意evnoy默认采样是100%
然而代码里面如果不加上SamplingPriority采样率默认是0,就是不采样,导致一开始看什么数据都没有

服务之间调用,解析头

存在协议兼容的问题,统一使用兼容zipkin的b3协议头,如果代码跟jaeger-client-go事例一样,就会报错

     var serverSpan opentracing.Span
     wireContext, err := opentracing.GlobalTracer().Extract(
         opentracing.HTTPHeaders,
         opentracing.HTTPHeadersCarrier(r.Header))
     if err != nil {
         fmt.Printf("Error %s", err.Error())
         return
     }

Error opentracing: SpanContext not found in Extract carrierx-request-id :5aeda785-786d-4ca1-8362-0eeb72b8f70a  

官方文档里面也提到 http-b3-compatible-header,事例如NewZipkinB3HTTPHeaderPropagator

    // Recommended configuration for production.
    cfg := jaegercfg.Configuration{}

    // Example logger and metrics factory. Use github.com/uber/jaeger-client-go/log
    // and github.com/uber/jaeger-lib/metrics respectively to bind to real logging and metrics
    // frameworks.
    jLogger := jaegerlog.StdLogger
    jMetricsFactory := metrics.NullFactory

    // Zipkin shares span ID between client and server spans; it must be enabled via the following option.
    zipkinPropagator := zipkin.NewZipkinB3HTTPHeaderPropagator()

    // Create tracer and then initialize global tracer
    closer, err := cfg.InitGlobalTracer(
      serviceName,
      jaegercfg.Logger(jLogger),
      jaegercfg.Metrics(jMetricsFactory),
      jaegercfg.Injector(opentracing.HTTPHeaders, zipkinPropagator),
      jaegercfg.Extractor(opentracing.HTTPHeaders, zipkinPropagator),
      jaegercfg.ZipkinSharedRPCSpan(true),
    )

    if err != nil {
        log.Printf("Could not initialize jaeger tracer: %s", err.Error())
        return
    }
    defer closer.Close()

最终的服务之间的转调代码

     var serverSpan opentracing.Span
     wireContext, err := opentracing.GlobalTracer().Extract(
         opentracing.HTTPHeaders,
         opentracing.HTTPHeadersCarrier(r.Header))
     if err != nil {
         fmt.Printf("Error %s", err.Error())
         return
     }

     // Create the span referring to the RPC client if available.
     // If wireContext == nil, a root span will be created.
     serverSpan = opentracing.StartSpan(
         "tracinga=>tracingb",
         ext.RPCServerOption(wireContext))
     defer serverSpan.Finish()
     serverSpan = serverSpan.SetOperationName("tracinga=>tracingb")
     serverSpan = serverSpan.SetTag("kind", "server")

     sp := opentracing.StartSpan(
         "tracingb=>tracingc",
         opentracing.ChildOf(wireContext))
     defer sp.Finish()
     sp = sp.SetOperationName("tracingb=>tracingc")
     sp = sp.SetTag("kind", "client")

     req, err := http.NewRequest("GET", "http://127.0.0.1:9802/tracingc/c", nil)
     if err != nil {
         fmt.Printf("%s\n", err)
         return
     }

     // Transmit the span's TraceContext as HTTP headers on our
     // outbound request.
     opentracing.GlobalTracer().Inject(
         sp.Context(),
         opentracing.HTTPHeaders,
         opentracing.HTTPHeadersCarrier(req.Header))

     client := &http.Client{}
     resp, err := client.Do(req)
采样

envoy

type HttpConnectionManager_Tracing struct {  
    // The span name will be derived from this field.
    OperationName HttpConnectionManager_Tracing_OperationName `protobuf:"varint,1,opt,name=operation_name,json=operationName,proto3,enum=envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager_Tracing_OperationName" json:"operation_name,omitempty"`
    // A list of header names used to create tags for the active span. The header name is used to
    // populate the tag name, and the header value is used to populate the tag value. The tag is
    // created if the specified header name is present in the request's headers.
    RequestHeadersForTags []string `protobuf:"bytes,2,rep,name=request_headers_for_tags,json=requestHeadersForTags,proto3" json:"request_headers_for_tags,omitempty"`
    // Target percentage of requests managed by this HTTP connection manager that will be force
    // traced if the :ref:`x-client-trace-id <config_http_conn_man_headers_x-client-trace-id>`
    // header is set. This field is a direct analog for the runtime variable
    // 'tracing.client_sampling' in the :ref:`HTTP Connection Manager
    // <config_http_conn_man_runtime>`.
    // Default: 100%
    ClientSampling *_type.Percent `protobuf:"bytes,3,opt,name=client_sampling,json=clientSampling,proto3" json:"client_sampling,omitempty"`
    // Target percentage of requests managed by this HTTP connection manager that will be randomly
    // selected for trace generation, if not requested by the client or not forced. This field is
    // a direct analog for the runtime variable 'tracing.random_sampling' in the
    // :ref:`HTTP Connection Manager <config_http_conn_man_runtime>`.
    // Default: 100%
    RandomSampling *_type.Percent `protobuf:"bytes,4,opt,name=random_sampling,json=randomSampling,proto3" json:"random_sampling,omitempty"`
    // Target percentage of requests managed by this HTTP connection manager that will be traced
    // after all other sampling checks have been applied (client-directed, force tracing, random
    // sampling). This field functions as an upper limit on the total configured sampling rate. For
    // instance, setting client_sampling to 100% but overall_sampling to 1% will result in only 1%
    // of client requests with the appropriate headers to be force traced. This field is a direct
    // analog for the runtime variable 'tracing.global_enabled' in the
    // :ref:`HTTP Connection Manager <config_http_conn_man_runtime>`.
    // Default: 100%
    OverallSampling      *_type.Percent `protobuf:"bytes,5,opt,name=overall_sampling,json=overallSampling,proto3" json:"overall_sampling,omitempty"`
    XXX_NoUnkeyedLiteral struct{}       `json:"-"`
    XXX_unrecognized     []byte         `json:"-"`
    XXX_sizecache        int32          `json:"-"`
}


需要注意的是,某条trace首个上报点就已经决定上报样本了
中间的设置都是无效的,除非你手动该header x-b3-sampled(0 或者 1)

由于envoy不支持对于egress的单个服务配置tracing,而我们本身的front-envoy就是一个exgress的网关,所以我们暂时摘除front-envoy的tracing

然后发现了一个bug,对于envoy来说,tracing的配置为零,利用ads是下发不下去的,只能直接在envoy.yaml内配置才行

bug

上面配置这份配置只能满足 front envoylocal envoy
实际服务之间的调用 http或者grpc的方式

团队是使用 service mesh(服务网格),通过配置下发到各个服务代理
但是envoysdk 有个bug

import (  
        ...
    http_conn_manager "github.com/envoyproxy/go-control-plane/envoy/config/filter/network/http_connection_manager/v2"
        ...
)
    listenFilterHttpConn.Tracing = &http_conn_manager.HttpConnectionManager_Tracing{
        OperationName:  http_conn_manager.INGRESS, 
        RandomSampling: &_type.Percent{Value: 1.0},
    }

    listenFilterHttpConnConv, err := util.MessageToStruct(listenFilterHttpConn)


这个http_conn_manager.INGRESS是一个常量,对应的是数字0
转化过程会利用json序列化方式来,最终生成的配置变成了

    tracing:
        random_sampling: 1.0

正常来讲应该是长这样的

    tracing:
        operation_name: ingress
        random_sampling: 1.0


我提的issues

代码入侵的部分

gin 中间件
package gin_middleware

import (  
        "fmt"
        "github.com/gin-gonic/gin"
        "github.com/opentracing/basictracer-go"
        "github.com/opentracing/opentracing-go"
        "github.com/opentracing/opentracing-go/ext"   
        "strconv"
)

func HttpResponseTrace() gin.HandlerFunc {  
    return func(ctx *gin.Context) {
        endPoint := ctx.Request.URL.Path

        var serverSpan opentracing.Span
        wireContext, err := opentracing.GlobalTracer().Extract(
            opentracing.HTTPHeaders,
            opentracing.HTTPHeadersCarrier(ctx.Request.Header))
        if err != nil {
            // 如果解析不出来,说明上游并没有接入tracing,直接无视即可
            log.Logger.Debug(fmt.Sprintf("[tracing]please ignore this error if you dont't care tracing, extract %s failed", err.Error()))
            err = nil 
        }   

        serverSpan = opentracing.StartSpan(
            endPoint,
            ext.RPCServerOption(wireContext))
        defer serverSpan.Finish()
        serverSpan = serverSpan.SetTag("kind", "server")

        if sc, ok := serverSpan.Context().(basictracer.SpanContext); ok {
            ctx.Writer.Header().Set("X-B3-TraceId", strconv.FormatUint(sc.TraceID, 10))
        }   

        ctx.Set("CUR_REQ_SPAN_STACK", []*opentracing.Span{&serverSpan})
        ctx.Next()

        if relativePath := ctx.GetString("RELATIVE_PATH"); relativePath != "" {
            endPoint = relativePath
        }   
        serverSpan = serverSpan.SetOperationName(endPoint)

        statusCode := ctx.Writer.Status()
        comment := ctx.Errors.ByType(gin.ErrorTypePrivate).String()

        if statusCode >= 500 {
            serverSpan.LogKV("error", fmt.Errorf("%s", comment))
        }   
    }   
}

通过gin的中间件 context 传递,redis client

// config 当前操作的name,默认不传
func TraceWrapRedisClient(ctx *gin.Context, c *redis.Client, config ...string) *redis.Client {  
    if ctx == nil {
        return c
    }   
    var (
        spanStackInterface interface{}
        flag               bool
        spanStacks         []*opentracing.Span
        err                error
        parentSpan         *opentracing.Span
    )   
    if spanStackInterface, flag = ctx.Get("CUR_REQ_SPAN_STACK"); !flag {
        return c
    }   
    if spanStacks, flag = spanStackInterface.([]*opentracing.Span); flag == true || len(spanStacks) <= 0 { 
        return c
    }   

    parentSpan = spanStacks[len(spanStacks)-1]

    copy := c.WithContext(c.Context())
    copy.WrapProcess(func(oldProcess func(cmd redis.Cmder) error) func(cmd redis.Cmder) error {
        return func(cmd redis.Cmder) error {
            tr := (*parentSpan).Tracer()
            sp := tr.StartSpan("redis", opentracing.ChildOf((*parentSpan).Context()))
            defer sp.Finish()
            ext.DBType.Set(sp, "redis")
            sp.SetTag("db.method", cmd.Name())
            if config != nil && len(config) > 0 { 
                sp.SetTag("db.opername", config[0])
            }   
            if err := oldProcess(cmd); err != nil {
                sp.LogKV("error", err.Error())
            }   
            return err 
        }   
    })  
    return copy
}
执行块方式
在代码的位置
TraceWrapExecBlockStart  
defer TraceWrapExecBlockEnd()

// ctx: 当前http请求的gin ctx
func TraceWrapExecBlockStart(ctx *gin.Context, name string) {  
    var (
        spanStackInterface interface{}
        flag               bool
        spanStacks         []*opentracing.Span
        parentSpan         *opentracing.Span
    )
    if spanStackInterface, flag = ctx.Get("CUR_REQ_SPAN_STACK"); !flag {
        return
    }
    if spanStacks, flag = spanStackInterface.([]*opentracing.Span); flag == false || len(spanStacks) <= 0 {
        return
    }
    parentSpan = spanStacks[len(spanStacks)-1]
    sp := opentracing.StartSpan(name, opentracing.ChildOf((*parentSpan).Context()))
    spanStacks = append(spanStacks, &sp)
    ctx.Set("CUR_REQ_SPAN_STACK", spanStacks)
    return
}

// ctx: 当前http请求的gin ctx
// err: 传入当前span的错误,如果没有传入nil
func TraceWrapExecBlockEnd(ctx *gin.Context, err error) {  
    var (
        spanStackInterface interface{}
        flag               bool
        spanStacks         []*opentracing.Span
        curSpan            *opentracing.Span
    )

    if spanStackInterface, flag = ctx.Get("CUR_REQ_SPAN_STACK"); !flag {
        return
    }
    if spanStacks, flag = spanStackInterface.([]*opentracing.Span); flag == false || len(spanStacks) <= 0 {
        return
    }
    curSpan = spanStacks[len(spanStacks)-1]
    spanStacks = spanStacks[0 : len(spanStacks)-1]
    ctx.Set("CUR_REQ_SPAN_STACK", spanStacks)
    if err != nil {
        (*curSpan).LogKV("error", err.Error())
    }
    (*curSpan).Finish()
    return
}
依赖图

依赖图需要安装一个定时任务