# 【生产Debug日记】Go Map 并发操作导致的K8S服务重启

# 背景

最近从grafanaalert中收到一个告警,告警内容是K8S服务pod重启了,查看pod日志描述信息显示如下:

# kubectl describe xxxxxx-api-6577f45dd5-9gprq
Name:         xxxxxx-api-6577f45dd5-9gprq
Namespace:    xxxxxx
Priority:     0
Node:         10.0.11.15/10.0.11.15
Start Time:   Tue, 24 Dec 2024 16:23:07 +0800
Labels:       pod-template-hash=6577f45dd5
              app=xxxxxx-api
Annotations:  <none>
Status:       Running
IP:           10.230.83.56
IPs:
  IP:           10.230.83.56
Controlled By:  ReplicaSet/xxxxxx-api-6577f45dd5
Containers:
  xxxxxx-api:
    Container ID:   docker://40ad476607e55cec892f495ac668ad8e30dc6e3f8c4e80a50c00ddc926f5e918
    Image:          harbor.k8s.com/fanli_xxxxxx/xxxxxx_api:v0.0.0-20241224161914
    Image ID:       docker-pullable://harbor.k8s.com/xxxxxx/xxxxxx_api@sha256:b04d274d0c448ed1159d1ea341fcd8ec3480016c260ac7b151eea607d0da9458
    Port:           8888/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 25 Dec 2024 08:44:18 +0800
    Last State:     Terminated
      Reason:       Error
      xxxxxx Code:    2
      Started:      Wed, 25 Dec 2024 05:12:12 +0800
      Finished:     Wed, 25 Dec 2024 08:44:16 +0800
    Ready:          True
    Restart Count:  4
    Limits:
      cpu:  4
    Requests:
      cpu:      500m
      memory:   1048Mi
    Liveness:   http-get http://:8888/healthcheck delay=5s timeout=3s period=45s #success=1 #failure=3
    Readiness:  http-get http://:8888/healthcheck delay=5s timeout=1s period=10s #success=3 #failure=3
    Environment:
      MY_POD_NAME:  xxxxxx-api-6577f45dd5-9gprq (v1:metadata.name)
    Mounts:
      /data/applogs from xxxxxxwebdata-log (rw)
      /data/weblogs from xxxxxxwebdata-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zd7z8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  xxxxxxwebdata-log:
    Type:          HostPath (bare host directory volume)
    Path:          /xxxxxx/logs/
    HostPathType:  DirectoryOrCreate
  default-token-zd7z8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-zd7z8
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  type=Physical-machine
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

可以从描述信息中看出,Pod在运行时重启了4次,并且最后一次重启是因为xxxxxx Code为2,表示程序异常退出。我们继续查看Pod的日志,以获取更多详细信息。

# kubectl logs xxxxxx-api-6577f45dd5-9gprq --previous

输入日志很多,主要的报错信息如下:

fatal error: concurrent map iteration and map write

goroutine 162439 [running]:
github.com/dtapps/go-library/utils/gorequest.(*Params).DeepCopy(...)
	/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/utils/gorequest/params.go:66
github.com/dtapps/go-library/utils/gorequest.request(0xc0004204e0, {0x3111ee0, 0x481d920})
	/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/utils/gorequest/http.go:195 +0x19a
github.com/dtapps/go-library/utils/gorequest.(*App).Get(0xc000562710?, {0x3111ee0?, 0x481d920?}, {0x0?, 0xc0005627a0?, 0x411c5b?})
	/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/utils/gorequest/http.go:170 +0xd8
github.com/dtapps/go-library/service/pinduoduo.(*Client).request(0xc0004369c0, {0x3111ee0, 0x481d920}, 0xc0010c0d80)
	/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/service/pinduoduo/request.go:21 +0x197
github.com/dtapps/go-library/service/pinduoduo.(*Client).GoodsDetail(0xc0004369c0, {0x3111ee0, 0x481d920}, {0xc000562d90?, 0x1?, 0x1?})
	/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/service/pinduoduo/pdd.ddk.goods.detail.go:106 +0x11c
gitea.xxxxxx.com/goweb/fsdk-go/fsdkunion.(*PddClient).PddDdkGoodsDetail(0xc000211620?, 0x0, {0xc0014bd1d0, 0x24}, 0xc0010c0d50)
	/go/pkg/mod/gitea.xxxxxx.com/goweb/fsdk-go@v0.0.0-20241221060227-0ed7d0a9c9ac/fsdkunion/pdd.go:69 +0x96
gitea.xxxxxx.com/goweb/xxxxxx/api/internal/service.(*Service).GetUnionPddItemDetail(0xc000783500, 0xc0010c0d50?, {0xc0014bd1d0?, 0xa?}, 0xc000514120?)
	/builder/api/internal/service/unionpddservice.go:143 +0x8b
gitea.xxxxxx.com/goweb/xxxxxx/api/internal/service.(*Service).GetUnionPddItemsByGoodsSignList.func1({0xc0011eb1a0?, 0xc0016c68c0?, 0xc00061b7a0?})
	/builder/api/internal/service/unionpddservice.go:96 +0x179
gitea.xxxxxx.com/goweb/fsdk-go/fsdktype.(*SafeGo).Go.func1({0x0?, 0x72aea5?, 0xc000ef5d40?})
	/go/pkg/mod/gitea.xxxxxx.com/goweb/fsdk-go@v0.0.0-20241221060227-0ed7d0a9c9ac/fsdktype/safego.go:36 +0x96
created by gitea.xxxxxx.com/goweb/fsdk-go/fsdktype.(*SafeGo).Go in goroutine 162413
	/go/pkg/mod/gitea.xxxxxx.com/goweb/fsdk-go@v0.0.0-20241221060227-0ed7d0a9c9ac/fsdktype/safego.go:24 +0xe7

从上面的日志可以看出,服务退出的原因是:map发生了并发的遍历和修改。发生在github.com/dtapps/go-library/utils/gorequest.(*Params).DeepCopy函数中。函数内容如下:

// DeepCopy 深度复制
func (p *Params) DeepCopy() map[string]interface{} {
	targetMap := make(map[string]interface{})

	// 从原始复制到目标
	for key, value := range *p {
		targetMap[key] = value
	}

	// 重新申请一个新的map
	*p = map[string]interface{}{}
	return targetMap
}

从代码中可以看出,DeepCopy函数的主要功能是将Params类型的值复制到一个新的map中,并清空原始的map。这里发送并发读写的只能是*p。我们接着往上找Prams类型的定义和引用。

// Params 参数
type Params map[string]interface{}

// App 实例
type App struct {
	Uri                          string           // 全局请求地址,没有设置url才会使用
	Error                        error            // 错误
	httpUri                      string           // 请求地址
	httpMethod                   string           // 请求方法
	httpHeader                   Headers          // 请求头
	httpParams                   Params           // 请求参数
	httpCookie                   string           // Cookie
	responseContent              Response         // 返回内容
	httpContentType              string           // 请求内容类型
	debug                        bool             // 是否开启调试模式
	p12Cert                      *tls.Certificate // p12证书内容
	tlsMinVersion, tlsMaxVersion uint16           // TLS版本
	config                       struct {
		systemOs     string // 系统类型
		systemKernel string // 系统内核
		goVersion    string // go版本
		sdkVersion   string // sdk版本
	}
}

到这里我们可以看出,Params类型是App结构体中的一个字段。App结构体在github.com/dtapps/go-library/utils/gorequest包中定义。我们接着往上找App类型的定义和引用。

// Client 实例
type Client struct {
	requestClient *gorequest.App // 请求服务
	config        struct {
		clientId     string // POP分配给应用的client_id
		clientSecret string // POP分配给应用的client_secret
		mediaId      string // 媒体ID
		pid          string // 推广位
	}
	zap struct {
		status bool             // 状态
		client *golog.ApiZapLog // 日志服务
	}
}

从代码中可以看出,App类型是Client结构体中的一个字段。Client类型在github.com/dtapps/go-library/service/pinduoduo包中定义。我们接着往上找Client类型的定义和引用。

// 拼多多客户端
type PddClient struct {
	AppKey    string
	AppSecret string
	MediaId   string
	Pid       string
	client    *pinduoduo.Client
}

到这里事情就明了了,我们在服务中定义拼多多客户端的时候只实例了一个pinduoduo.Client,当又多个请求同时调用GetUnionPddItemDetail方法时,就会发生并发读写Params类型的值,导致程序崩溃。

# 解决方案

在我们自己定义的PddClient中,我们只记录拼多多的AppKeyAppSecretMediaIdPid,而不记录pinduoduo.Client。在每次请求时,我们重新实例化pinduoduo.Client,这样就不会发生并发读写Params类型的值了。

// 拼多多客户端
type PddClient struct {
	AppKey    string
	AppSecret string
	MediaId   string
	Pid       string
}

// Client 实例
func (p *PddClient) Client() *pinduoduo.Client {
	return pinduoduo.NewClient(p.AppKey, p.AppSecret, p.MediaId, p.Pid)
}