tracing方案
我们基本方案,是基于service mesh (envoy打脸最终并没有) 和 jaeger 来实现的
环境安装
关于jaeger的搭建,主要参考官方文档 jaeger-k8s 来的 ,其中注意关于 五个组件的区分
端口号 | 协议 | 组件 | 功能 |
---|---|---|---|
5775 | UDP | agent | 通过thrift的compact协议接收zipkin.thrift数据 |
6831 | UDP | agent | 通过thrift的compact协议接收jaeger.thrift数据 |
6832 | UDP | agent | 通过thrift的二进制协议接收jaeger.thrift数据 |
5778 | HTTP | agent | 用于配置接口 |
16686 | HTTP | query | 用于UI界面 |
14268 | HTTP | collector | 直接接受客户端直连的jaeger.thrift |
14250 | HTTP | collector | 接受model.proto |
9411 | HTTP | collector | 兼容zipkin的http端点 |
其中 agent
和 collector
是放在 公用服务的k8s环境里面的(a01),而 jaeger web ui
是可以放在管理环境(a02)
其中数据的后端,并没有采用cassandra后端,而是使用elasticsearch
(采用实体机搭建的es集群,与重要事件上报公用的es集群)
测试环境
完整一键式安装,只适用于测试
docker run -d -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
-p5775:5775/udp \
-p6831:6831/udp \
-p6832:6832/udp \
-p5778:5778 \
-p16686:16686 \
-p14268:14268 \
-p9411:9411 \
jaegertracing/all-in-one:latest
生产环境
https://raw.githubusercontent.com/jaegertracing/jaeger-kubernetes/master/production-elasticsearch/configmap.yml
https://raw.githubusercontent.com/jaegertracing/jaeger-kubernetes/master/production-elasticsearch/elasticsearch.yml
官方给出了两份yaml文件,我们略做了修改在测试环境上使用
主要是agent
的daemonset
修改为deployment
,还有es
的服务修改为真实ip 加 端口地址
使用方式
代理服务上报
在各个服务的local envoy
代理开启了tracing
的开关
(最终取消了这种方式)
tracing:
http:
name: envoy.zipkin
config:
collector_cluster: jaeger
collector_endpoint: "/api/v1/spans"
......
static_resources:
clusters:
- name: jaeger
connect_timeout: 0.25s
type: strict_dns
lb_policy: round_robin
hosts:
- socket_address:
address: jaeger-collector
port_value: 9411
其中cluster
为jaeger
的地址jaeger-collector
一定是k8s里面可连接的(看端口可知是http接口)
服务之间调用
需要添加egress
的配置,主要服务之间调用使用出口的信息
官方的实例配置
listeners:
.....
tracing:
operation_name: egress
入口
针对入口的front envoy
代理还需要设置生成唯一idgeneraterequestid: true
同样,我们也是通过service mesh
(服务网格)配置下发到各个服务
存储调用
这个部分可能需要我们手动开发,作为上报来使用(尽管envoy
支持mysql
、redis
的代理,却不支持tracing
)例如mysql
、redis
直接上报
// main.go
package main
import (
"github.com/opentracing/opentracing-go"
"github.com/opentracing/opentracing-go/ext"
jaegercfg "github.com/uber/jaeger-client-go/config"
"log"
"time"
)
func main() {
cfg, err := jaegercfg.FromEnv()
if err != nil {
// parsing errors might happen here, such as when we get a string where we expect a number
log.Printf("Could not parse Jaeger env vars: %s", err.Error())
return
}
tracer, closer, err := cfg.NewTracer()
if err != nil {
log.Printf("Could not initialize jaeger tracer: %s", err.Error())
return
}
defer closer.Close()
opentracing.SetGlobalTracer(tracer)
// continue main()
span := opentracing.StartSpan("test_chainhelen")
ext.SamplingPriority.Set(span, 1)
defer func() {
span.Finish()
}()
time.Sleep(time.Duration(2) * time.Second)
log.Printf("main...\n")
}
export JAEGER_SERVICE_NAME=chainhelen_service
export JAEGER_REPORTER_LOG_SPANS=true
export JAEGER_ENDPOINT=http://127.0.0.1:14268/api/traces
go build -o main main.go
./main
其中注意evnoy
默认采样是100%
然而代码里面如果不加上SamplingPriority
采样率默认是0,就是不采样,导致一开始看什么数据都没有
服务之间调用,解析头
存在协议兼容的问题,统一使用兼容zipkin的b3协议头,如果代码跟jaeger-client-go
事例一样,就会报错
var serverSpan opentracing.Span
wireContext, err := opentracing.GlobalTracer().Extract(
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(r.Header))
if err != nil {
fmt.Printf("Error %s", err.Error())
return
}
Error opentracing: SpanContext not found in Extract carrierx-request-id :5aeda785-786d-4ca1-8362-0eeb72b8f70a
官方文档里面也提到 http-b3-compatible-header,事例如NewZipkinB3HTTPHeaderPropagator
// Recommended configuration for production.
cfg := jaegercfg.Configuration{}
// Example logger and metrics factory. Use github.com/uber/jaeger-client-go/log
// and github.com/uber/jaeger-lib/metrics respectively to bind to real logging and metrics
// frameworks.
jLogger := jaegerlog.StdLogger
jMetricsFactory := metrics.NullFactory
// Zipkin shares span ID between client and server spans; it must be enabled via the following option.
zipkinPropagator := zipkin.NewZipkinB3HTTPHeaderPropagator()
// Create tracer and then initialize global tracer
closer, err := cfg.InitGlobalTracer(
serviceName,
jaegercfg.Logger(jLogger),
jaegercfg.Metrics(jMetricsFactory),
jaegercfg.Injector(opentracing.HTTPHeaders, zipkinPropagator),
jaegercfg.Extractor(opentracing.HTTPHeaders, zipkinPropagator),
jaegercfg.ZipkinSharedRPCSpan(true),
)
if err != nil {
log.Printf("Could not initialize jaeger tracer: %s", err.Error())
return
}
defer closer.Close()
最终的服务之间的转调代码
var serverSpan opentracing.Span
wireContext, err := opentracing.GlobalTracer().Extract(
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(r.Header))
if err != nil {
fmt.Printf("Error %s", err.Error())
return
}
// Create the span referring to the RPC client if available.
// If wireContext == nil, a root span will be created.
serverSpan = opentracing.StartSpan(
"tracinga=>tracingb",
ext.RPCServerOption(wireContext))
defer serverSpan.Finish()
serverSpan = serverSpan.SetOperationName("tracinga=>tracingb")
serverSpan = serverSpan.SetTag("kind", "server")
sp := opentracing.StartSpan(
"tracingb=>tracingc",
opentracing.ChildOf(wireContext))
defer sp.Finish()
sp = sp.SetOperationName("tracingb=>tracingc")
sp = sp.SetTag("kind", "client")
req, err := http.NewRequest("GET", "http://127.0.0.1:9802/tracingc/c", nil)
if err != nil {
fmt.Printf("%s\n", err)
return
}
// Transmit the span's TraceContext as HTTP headers on our
// outbound request.
opentracing.GlobalTracer().Inject(
sp.Context(),
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(req.Header))
client := &http.Client{}
resp, err := client.Do(req)
采样
envoy
type HttpConnectionManager_Tracing struct {
// The span name will be derived from this field.
OperationName HttpConnectionManager_Tracing_OperationName `protobuf:"varint,1,opt,name=operation_name,json=operationName,proto3,enum=envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager_Tracing_OperationName" json:"operation_name,omitempty"`
// A list of header names used to create tags for the active span. The header name is used to
// populate the tag name, and the header value is used to populate the tag value. The tag is
// created if the specified header name is present in the request's headers.
RequestHeadersForTags []string `protobuf:"bytes,2,rep,name=request_headers_for_tags,json=requestHeadersForTags,proto3" json:"request_headers_for_tags,omitempty"`
// Target percentage of requests managed by this HTTP connection manager that will be force
// traced if the :ref:`x-client-trace-id <config_http_conn_man_headers_x-client-trace-id>`
// header is set. This field is a direct analog for the runtime variable
// 'tracing.client_sampling' in the :ref:`HTTP Connection Manager
// <config_http_conn_man_runtime>`.
// Default: 100%
ClientSampling *_type.Percent `protobuf:"bytes,3,opt,name=client_sampling,json=clientSampling,proto3" json:"client_sampling,omitempty"`
// Target percentage of requests managed by this HTTP connection manager that will be randomly
// selected for trace generation, if not requested by the client or not forced. This field is
// a direct analog for the runtime variable 'tracing.random_sampling' in the
// :ref:`HTTP Connection Manager <config_http_conn_man_runtime>`.
// Default: 100%
RandomSampling *_type.Percent `protobuf:"bytes,4,opt,name=random_sampling,json=randomSampling,proto3" json:"random_sampling,omitempty"`
// Target percentage of requests managed by this HTTP connection manager that will be traced
// after all other sampling checks have been applied (client-directed, force tracing, random
// sampling). This field functions as an upper limit on the total configured sampling rate. For
// instance, setting client_sampling to 100% but overall_sampling to 1% will result in only 1%
// of client requests with the appropriate headers to be force traced. This field is a direct
// analog for the runtime variable 'tracing.global_enabled' in the
// :ref:`HTTP Connection Manager <config_http_conn_man_runtime>`.
// Default: 100%
OverallSampling *_type.Percent `protobuf:"bytes,5,opt,name=overall_sampling,json=overallSampling,proto3" json:"overall_sampling,omitempty"`
XXX_NoUnkeyedLiteral struct{} `json:"-"`
XXX_unrecognized []byte `json:"-"`
XXX_sizecache int32 `json:"-"`
}
需要注意的是,某条trace首个上报点就已经决定上报样本了
中间的设置都是无效的,除非你手动该header x-b3-sampled
(0 或者 1)
由于envoy不支持对于egress的单个服务配置tracing,而我们本身的front-envoy就是一个exgress的网关,所以我们暂时摘除front-envoy的tracing
然后发现了一个bug,对于envoy来说,tracing的配置为零,利用ads是下发不下去的,只能直接在envoy.yaml内配置才行
bug
上面配置这份配置只能满足 front envoy
到 local envoy
实际服务之间的调用 http
或者grpc
的方式
团队是使用 service mesh(服务网格),通过配置下发到各个服务代理
但是envoy
的sdk
有个bug
import (
...
http_conn_manager "github.com/envoyproxy/go-control-plane/envoy/config/filter/network/http_connection_manager/v2"
...
)
listenFilterHttpConn.Tracing = &http_conn_manager.HttpConnectionManager_Tracing{
OperationName: http_conn_manager.INGRESS,
RandomSampling: &_type.Percent{Value: 1.0},
}
listenFilterHttpConnConv, err := util.MessageToStruct(listenFilterHttpConn)
这个http_conn_manager.INGRESS
是一个常量,对应的是数字0
转化过程会利用json
序列化方式来,最终生成的配置变成了
tracing:
random_sampling: 1.0
正常来讲应该是长这样的
tracing:
operation_name: ingress
random_sampling: 1.0
我提的issues
代码入侵的部分
gin 中间件
package gin_middleware
import (
"fmt"
"github.com/gin-gonic/gin"
"github.com/opentracing/basictracer-go"
"github.com/opentracing/opentracing-go"
"github.com/opentracing/opentracing-go/ext"
"strconv"
)
func HttpResponseTrace() gin.HandlerFunc {
return func(ctx *gin.Context) {
endPoint := ctx.Request.URL.Path
var serverSpan opentracing.Span
wireContext, err := opentracing.GlobalTracer().Extract(
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(ctx.Request.Header))
if err != nil {
// 如果解析不出来,说明上游并没有接入tracing,直接无视即可
log.Logger.Debug(fmt.Sprintf("[tracing]please ignore this error if you dont't care tracing, extract %s failed", err.Error()))
err = nil
}
serverSpan = opentracing.StartSpan(
endPoint,
ext.RPCServerOption(wireContext))
defer serverSpan.Finish()
serverSpan = serverSpan.SetTag("kind", "server")
if sc, ok := serverSpan.Context().(basictracer.SpanContext); ok {
ctx.Writer.Header().Set("X-B3-TraceId", strconv.FormatUint(sc.TraceID, 10))
}
ctx.Set("CUR_REQ_SPAN_STACK", []*opentracing.Span{&serverSpan})
ctx.Next()
if relativePath := ctx.GetString("RELATIVE_PATH"); relativePath != "" {
endPoint = relativePath
}
serverSpan = serverSpan.SetOperationName(endPoint)
statusCode := ctx.Writer.Status()
comment := ctx.Errors.ByType(gin.ErrorTypePrivate).String()
if statusCode >= 500 {
serverSpan.LogKV("error", fmt.Errorf("%s", comment))
}
}
}
通过gin的中间件 context 传递,redis client
// config 当前操作的name,默认不传
func TraceWrapRedisClient(ctx *gin.Context, c *redis.Client, config ...string) *redis.Client {
if ctx == nil {
return c
}
var (
spanStackInterface interface{}
flag bool
spanStacks []*opentracing.Span
err error
parentSpan *opentracing.Span
)
if spanStackInterface, flag = ctx.Get("CUR_REQ_SPAN_STACK"); !flag {
return c
}
if spanStacks, flag = spanStackInterface.([]*opentracing.Span); flag == true || len(spanStacks) <= 0 {
return c
}
parentSpan = spanStacks[len(spanStacks)-1]
copy := c.WithContext(c.Context())
copy.WrapProcess(func(oldProcess func(cmd redis.Cmder) error) func(cmd redis.Cmder) error {
return func(cmd redis.Cmder) error {
tr := (*parentSpan).Tracer()
sp := tr.StartSpan("redis", opentracing.ChildOf((*parentSpan).Context()))
defer sp.Finish()
ext.DBType.Set(sp, "redis")
sp.SetTag("db.method", cmd.Name())
if config != nil && len(config) > 0 {
sp.SetTag("db.opername", config[0])
}
if err := oldProcess(cmd); err != nil {
sp.LogKV("error", err.Error())
}
return err
}
})
return copy
}
执行块方式
在代码的位置
TraceWrapExecBlockStart
defer TraceWrapExecBlockEnd()
// ctx: 当前http请求的gin ctx
func TraceWrapExecBlockStart(ctx *gin.Context, name string) {
var (
spanStackInterface interface{}
flag bool
spanStacks []*opentracing.Span
parentSpan *opentracing.Span
)
if spanStackInterface, flag = ctx.Get("CUR_REQ_SPAN_STACK"); !flag {
return
}
if spanStacks, flag = spanStackInterface.([]*opentracing.Span); flag == false || len(spanStacks) <= 0 {
return
}
parentSpan = spanStacks[len(spanStacks)-1]
sp := opentracing.StartSpan(name, opentracing.ChildOf((*parentSpan).Context()))
spanStacks = append(spanStacks, &sp)
ctx.Set("CUR_REQ_SPAN_STACK", spanStacks)
return
}
// ctx: 当前http请求的gin ctx
// err: 传入当前span的错误,如果没有传入nil
func TraceWrapExecBlockEnd(ctx *gin.Context, err error) {
var (
spanStackInterface interface{}
flag bool
spanStacks []*opentracing.Span
curSpan *opentracing.Span
)
if spanStackInterface, flag = ctx.Get("CUR_REQ_SPAN_STACK"); !flag {
return
}
if spanStacks, flag = spanStackInterface.([]*opentracing.Span); flag == false || len(spanStacks) <= 0 {
return
}
curSpan = spanStacks[len(spanStacks)-1]
spanStacks = spanStacks[0 : len(spanStacks)-1]
ctx.Set("CUR_REQ_SPAN_STACK", spanStacks)
if err != nil {
(*curSpan).LogKV("error", err.Error())
}
(*curSpan).Finish()
return
}
依赖图
依赖图需要安装一个定时任务