The report about 0830 website and API unstable | 关于0830网站和API不稳定的说明

From August 30, the website service and API were unstable.
Now the accident of services are coming to an end.

Cause

35.172.100.150 and 185.100.235.148 have been sending a large number of requests
to https://api.steemit.com in a short period of time.

They were requesting two methods which are condenser_api.get_discussions_by_feed and condenser_api.get_discussions_by_author_before_date.

The SQL of these two methods involves subqueries, and the subqueries involve full table queries.

So the performance is very poor.

Before this accident there were no problems, due to the low frequency to these two methods
and some caching measures were already in place.

What happened this time was that the rapid batch requests from these two IPs
directly penetrated the Jussi cache layer, and putting all the pressure on Hivemind’s database,
and brought the database to its knees.

All services have been affected and down then.

Take measures

Since Jussi does not have an IP filtering function, the fastest solution is to deny IPs in an Openresty layer,
which was added before Jussi temporary. (I do not have permission to compayny Cloudflare account,
and the company's operator cannot be contacted for a short time)

It took half a day to configure, test, and get it online.

However, after going online, the pressure on the database has not been reduced.

Rechecking the logs, I found that there were other slow queries containing subqueries,
which were all heavily backlogged on the database, due to the impact of the previous two interfaces.

The Hivemind itself doesn't have a caching module, the cache is all on Jussi.

I had to write a temporary cache module to cache the subqueries' result, to improve the performence.

[https://github.com/steemit/hivemind/pull/320/commits/c53dfbd2465e0be4b5438e0fc47aed4cf648005f](https://github.com/steemit/hivemind/pull/320/commits/c53dfbd2465e0be4b5438e0fc47aed4cf 648005f)

After the development and testing is completed and online, the database pressure gradually recovers.

By 1:30 a.m. on September 1, Beijing time, most services were gradually restored.

However, the traffic of jussi is still very large.

For the sake of stability, I set a low rate limit on Jussi layer.

This is also the reason why the website were unstable in the past week.

In the past ten days, a lot of configuration and code have been temporarily added.

The most time was spent on tuning and testing.

At present, the service is gradually stabilizing.

Later, I will recove the pre-config of the rate limit of Jussi.

Other

Since I need to handle multiple things at the same time, I am unable to respond to many feedback and questions from the community at the first time.

Sorry.


从 8 月 30 日开始的网站服务不稳定,目前初步告一段落。

事故的原因

35.172.100.150185.100.235.148 短时间内一直大量发送请求到 https://api.steemit.com,请求的接口是 condenser_api.get_discussions_by_feedcondenser_api.get_discussions_by_author_before_date

这两个接口的 SQL 涉及了子查询,且子查询涉及全表查询,因此性能非常差。之前由于请求频次低,已经有一些缓存措施,所以服务没有出现问题。

这次的情况是,这两个 IP 的快速批量拉取数据,直接穿透了缓存层,把压力全部压到了 Hivemind 的数据库上,把数据库打垮了。

进而所有服务都受到了波及。

采取措施

由于 jussi 没有 IP 过滤功能,最快的方案是在 jussi 前引入一个 openresty 反向代理,把上面的两个 IP 拉黑(Cloudflare 我没有权限,公司运维短时间也联络不到)。

配置、测试、上线花费了半天的时间。

但是上线后,数据库的压力迟迟降不下来。

重新查看日志,发现还有其他的包含子查询的慢查询,因为受到之前两个接口的影响,都大量积压在数据库上。

由于 hivemind 本身不带缓存功能,缓存都是在 jussi 上。对于目前的情况没有优化空间。

于是在 hivemind 代码里临时增加了 Redis 缓存功能,对部分慢查询中的子查询进行了缓存。https://github.com/steemit/hivemind/pull/320/commits/c53dfbd2465e0be4b5438e0fc47aed4cf648005f

开发测试完毕,上线后,数据库压力逐步恢复。

到北京时间9月1日凌晨1点30分,其他服务也逐步恢复。

但是 jussi 的流量依然很大,为了稳定,我调低了 jussi 的接口访问频次。

这也是网站最近一周在访问高峰期出现服务降级的根本原因。

在过去的十天里,临时增加了很多配置和代码,增加了后端的机器数量。

目前服务在逐步平稳下来。

后面会根据实际情况,逐步恢复 jussi 的接口访问频次,直到恢复到故障前的状态。

其他

由于我本人需要同时处理多个事务,因此很多来自社区的反馈和提问我无法做到第一时间反馈,抱歉。

H2
H3
H4
3 columns
2 columns
1 column
18 Comments