# 【生产Debug日记】Go Map 并发操作导致的K8S服务重启
# 背景
最近从grafana
的alert
中收到一个告警,告警内容是K8S
服务pod
重启了,查看pod
日志描述信息显示如下:
# kubectl describe xxxxxx-api-6577f45dd5-9gprq
Name: xxxxxx-api-6577f45dd5-9gprq
Namespace: xxxxxx
Priority: 0
Node: 10.0.11.15/10.0.11.15
Start Time: Tue, 24 Dec 2024 16:23:07 +0800
Labels: pod-template-hash=6577f45dd5
app=xxxxxx-api
Annotations: <none>
Status: Running
IP: 10.230.83.56
IPs:
IP: 10.230.83.56
Controlled By: ReplicaSet/xxxxxx-api-6577f45dd5
Containers:
xxxxxx-api:
Container ID: docker://40ad476607e55cec892f495ac668ad8e30dc6e3f8c4e80a50c00ddc926f5e918
Image: harbor.k8s.com/fanli_xxxxxx/xxxxxx_api:v0.0.0-20241224161914
Image ID: docker-pullable://harbor.k8s.com/xxxxxx/xxxxxx_api@sha256:b04d274d0c448ed1159d1ea341fcd8ec3480016c260ac7b151eea607d0da9458
Port: 8888/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 25 Dec 2024 08:44:18 +0800
Last State: Terminated
Reason: Error
xxxxxx Code: 2
Started: Wed, 25 Dec 2024 05:12:12 +0800
Finished: Wed, 25 Dec 2024 08:44:16 +0800
Ready: True
Restart Count: 4
Limits:
cpu: 4
Requests:
cpu: 500m
memory: 1048Mi
Liveness: http-get http://:8888/healthcheck delay=5s timeout=3s period=45s #success=1 #failure=3
Readiness: http-get http://:8888/healthcheck delay=5s timeout=1s period=10s #success=3 #failure=3
Environment:
MY_POD_NAME: xxxxxx-api-6577f45dd5-9gprq (v1:metadata.name)
Mounts:
/data/applogs from xxxxxxwebdata-log (rw)
/data/weblogs from xxxxxxwebdata-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-zd7z8 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
xxxxxxwebdata-log:
Type: HostPath (bare host directory volume)
Path: /xxxxxx/logs/
HostPathType: DirectoryOrCreate
default-token-zd7z8:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-zd7z8
Optional: false
QoS Class: Burstable
Node-Selectors: type=Physical-machine
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
可以从描述信息中看出,Pod在运行时重启了4次,并且最后一次重启是因为xxxxxx Code为2,表示程序异常退出。我们继续查看Pod的日志,以获取更多详细信息。
# kubectl logs xxxxxx-api-6577f45dd5-9gprq --previous
输入日志很多,主要的报错信息如下:
fatal error: concurrent map iteration and map write
goroutine 162439 [running]:
github.com/dtapps/go-library/utils/gorequest.(*Params).DeepCopy(...)
/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/utils/gorequest/params.go:66
github.com/dtapps/go-library/utils/gorequest.request(0xc0004204e0, {0x3111ee0, 0x481d920})
/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/utils/gorequest/http.go:195 +0x19a
github.com/dtapps/go-library/utils/gorequest.(*App).Get(0xc000562710?, {0x3111ee0?, 0x481d920?}, {0x0?, 0xc0005627a0?, 0x411c5b?})
/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/utils/gorequest/http.go:170 +0xd8
github.com/dtapps/go-library/service/pinduoduo.(*Client).request(0xc0004369c0, {0x3111ee0, 0x481d920}, 0xc0010c0d80)
/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/service/pinduoduo/request.go:21 +0x197
github.com/dtapps/go-library/service/pinduoduo.(*Client).GoodsDetail(0xc0004369c0, {0x3111ee0, 0x481d920}, {0xc000562d90?, 0x1?, 0x1?})
/go/pkg/mod/github.com/dtapps/go-library@v1.0.157/service/pinduoduo/pdd.ddk.goods.detail.go:106 +0x11c
gitea.xxxxxx.com/goweb/fsdk-go/fsdkunion.(*PddClient).PddDdkGoodsDetail(0xc000211620?, 0x0, {0xc0014bd1d0, 0x24}, 0xc0010c0d50)
/go/pkg/mod/gitea.xxxxxx.com/goweb/fsdk-go@v0.0.0-20241221060227-0ed7d0a9c9ac/fsdkunion/pdd.go:69 +0x96
gitea.xxxxxx.com/goweb/xxxxxx/api/internal/service.(*Service).GetUnionPddItemDetail(0xc000783500, 0xc0010c0d50?, {0xc0014bd1d0?, 0xa?}, 0xc000514120?)
/builder/api/internal/service/unionpddservice.go:143 +0x8b
gitea.xxxxxx.com/goweb/xxxxxx/api/internal/service.(*Service).GetUnionPddItemsByGoodsSignList.func1({0xc0011eb1a0?, 0xc0016c68c0?, 0xc00061b7a0?})
/builder/api/internal/service/unionpddservice.go:96 +0x179
gitea.xxxxxx.com/goweb/fsdk-go/fsdktype.(*SafeGo).Go.func1({0x0?, 0x72aea5?, 0xc000ef5d40?})
/go/pkg/mod/gitea.xxxxxx.com/goweb/fsdk-go@v0.0.0-20241221060227-0ed7d0a9c9ac/fsdktype/safego.go:36 +0x96
created by gitea.xxxxxx.com/goweb/fsdk-go/fsdktype.(*SafeGo).Go in goroutine 162413
/go/pkg/mod/gitea.xxxxxx.com/goweb/fsdk-go@v0.0.0-20241221060227-0ed7d0a9c9ac/fsdktype/safego.go:24 +0xe7
从上面的日志可以看出,服务退出的原因是:map
发生了并发的遍历和修改。发生在github.com/dtapps/go-library/utils/gorequest.(*Params).DeepCopy
函数中。函数内容如下:
// DeepCopy 深度复制
func (p *Params) DeepCopy() map[string]interface{} {
targetMap := make(map[string]interface{})
// 从原始复制到目标
for key, value := range *p {
targetMap[key] = value
}
// 重新申请一个新的map
*p = map[string]interface{}{}
return targetMap
}
从代码中可以看出,DeepCopy
函数的主要功能是将Params
类型的值复制到一个新的map
中,并清空原始的map
。这里发送并发读写的只能是*p
。我们接着往上找Prams
类型的定义和引用。
// Params 参数
type Params map[string]interface{}
// App 实例
type App struct {
Uri string // 全局请求地址,没有设置url才会使用
Error error // 错误
httpUri string // 请求地址
httpMethod string // 请求方法
httpHeader Headers // 请求头
httpParams Params // 请求参数
httpCookie string // Cookie
responseContent Response // 返回内容
httpContentType string // 请求内容类型
debug bool // 是否开启调试模式
p12Cert *tls.Certificate // p12证书内容
tlsMinVersion, tlsMaxVersion uint16 // TLS版本
config struct {
systemOs string // 系统类型
systemKernel string // 系统内核
goVersion string // go版本
sdkVersion string // sdk版本
}
}
到这里我们可以看出,Params
类型是App
结构体中的一个字段。App
结构体在github.com/dtapps/go-library/utils/gorequest
包中定义。我们接着往上找App
类型的定义和引用。
// Client 实例
type Client struct {
requestClient *gorequest.App // 请求服务
config struct {
clientId string // POP分配给应用的client_id
clientSecret string // POP分配给应用的client_secret
mediaId string // 媒体ID
pid string // 推广位
}
zap struct {
status bool // 状态
client *golog.ApiZapLog // 日志服务
}
}
从代码中可以看出,App
类型是Client
结构体中的一个字段。Client
类型在github.com/dtapps/go-library/service/pinduoduo
包中定义。我们接着往上找Client
类型的定义和引用。
// 拼多多客户端
type PddClient struct {
AppKey string
AppSecret string
MediaId string
Pid string
client *pinduoduo.Client
}
到这里事情就明了了,我们在服务中定义拼多多客户端的时候只实例了一个pinduoduo.Client
,当又多个请求同时调用GetUnionPddItemDetail
方法时,就会发生并发读写Params
类型的值,导致程序崩溃。
# 解决方案
在我们自己定义的PddClient
中,我们只记录拼多多的AppKey
、AppSecret
、MediaId
和Pid
,而不记录pinduoduo.Client
。在每次请求时,我们重新实例化pinduoduo.Client
,这样就不会发生并发读写Params
类型的值了。
// 拼多多客户端
type PddClient struct {
AppKey string
AppSecret string
MediaId string
Pid string
}
// Client 实例
func (p *PddClient) Client() *pinduoduo.Client {
return pinduoduo.NewClient(p.AppKey, p.AppSecret, p.MediaId, p.Pid)
}