반응형
Linux에서 NTP(Network Time Protocol) 서비스가 실행 되면서 시스템 시간을 주기적으로 동기화 합니다.
NTP(ntpd 데몬)가 실행중이더라도 외부의 TimeServer에 접속이 되질 않는다면, 어느정도 시기가 지나면 서버의 시간이 느려지는 증상이 있습니다.

HDFS 컨테이너에서 요청 허용 시간의 범위를 체크하는데, 시간이 지나면서 각 서버의 시간차가 커지다 보면 작업을 실패합니다.


관련 오류
2016-07-22 17:04:44,437 ERROR [Thread-712]: SessionState (SessionState.java:printError(833)) - Status: Failed
2016-07-22 17:04:44,438 ERROR [Thread-712]: SessionState (SessionState.java:printError(833)) - Vertex failed, vertexName=Map 6, vertexId=vertex_1468916542684_0021_1_00, diagnostics=[Task failed, taskId=task_1468916542684_0021_1_00_000007, diagnostics=[TaskAttempt 0 failed, info=[Container launch failed for container_e54_1468916542684_0021_02_000002 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
This token is expired. current time is 1469175137466 found 1469175128093
Note: System times on machines may be out of sync. Check system time and time zones.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.tez.dag.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:168)
at org.apache.tez.dag.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:380)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
], TaskAttempt 1 failed, info=[Container launch failed for container_e54_1468916542684_0021_02_000015 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
This token is expired. current time is 1469175138445 found 1469175128918
Note: System times on machines may be out of sync. Check system time and time zones.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.tez.dag.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:168)
at org.apache.tez.dag.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:380)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
], TaskAttempt 2 failed, info=[Container launch failed for container_e54_1468916542684_0021_02_000017 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
This token is expired. current time is 1469175139258 found 1469175129970
Note: System times on machines may be out of sync. Check system time and time zones.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.tez.dag.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:168)
at org.apache.tez.dag.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:380)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
], TaskAttempt 3 failed, info=[Container launch failed for container_e54_1468916542684_0021_02_000020 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
This token is expired. current time is 1469175140315 found 1469175131018
Note: System times on machines may be out of sync. Check system time and time zones.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.tez.dag.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:168)
at org.apache.tez.dag.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:380)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1468916542684_0021_1_00 [Map 6] killed/failed due to:null]
2016-07-22 17:04:44,438 ERROR [Thread-712]: SessionState (SessionState.java:printError(833)) - Vertex killed, vertexName=Reducer 3, vertexId=vertex_1468916542684_0021_1_04, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0, Vertex vertex_1468916542684_0021_1_04 [Reducer 3] killed/failed due to:null]
2016-07-22 17:04:44,439 ERROR [Thread-712]: SessionState (SessionState.java:printError(833)) - DAG failed due to vertex failure. failedVertices:1 killedVertices:5


각 서버의 시간 정보를 확인후에 맞지 않을 경우 변경
 

# 시스템 현재 시간으로 수동 설정

> System Time 확인
[root@hdfs ~]# date
2016. 07. 25. (월) 21:57:52 KST
 
> 시분초 설정 (24시간제로 입력)
[root@hdfs ~]# date -s 23:43:21
 
> 연월일 시분초 바꾸기
[root@hdfs ~]# date -s '2016-7-26 11:21:21'


네트워크 내부에 NTP Server가 셋팅 되어 있다면, 그리고 HDFS 클러스터의 각 서버가 접근 가능하다면,
아래의 TimeServer(time.bora.net) 설정과 같이 내부 NTP 서버로 등록 가능.

# NTP 정보 확인

> ntpd 서비스 구동 여부 확인
[root@hdfs ~]# service ntpd status
ntpd가 정지되었습니다.
 
> ntpstat 명령을 사용하여 NTP 서비스의 상태를 확인
[root@hdfs ~]# ntpstat
synchronised to NTP server (211.233.40.78) at stratum 3
   time correct to within 163 ms
   polling server every 1024 s
 
> 작동 되지 않는 상태
[root@www ~]# ntpstat
Unable to talk to NTP daemon. Is it running?
 
> Time Server의 시간 조회
[root@hdfs ~]# rdate -p time.bora.net
rdate: [time.bora.net]  Mon Jul 25 21:10:57 2016
 

# 시스템 시간 동기화

> Time Server의 시간 동기화 설정(일시적)
[root@hdfs ~]# rdate -s time.bora.net
[root@hdfs ~]# date
2016. 07. 25. (월) 21:44:19 KST
 
> Time Server 시간 주기적 동기화 ( 매일 24시에 동기화 )
[root@hdfs ~]# crontab -e
0 0 * * * rdate -s time.bora.net


반응형
반응형

Vertica ROS Container 관련 오류 및 설정


java.sql.SQLTransientException: [Vertica][VJDBC](5065) ERROR: Too many ROS containers exist for the following projections:
public.DATASET_TABLE_super (limit = 11264, ROS files = 11264, DV files = 0, new files = 11)
at com.vertica.util.ServerErrorData.buildException(Unknown Source)
at com.vertica.dataengine.VQueryExecutor.readCopyStartResponse(Unknown Source)
at com.vertica.dataengine.VQueryExecutor.handleExecuteResponse(Unknown Source)
at com.vertica.dataengine.VQueryExecutor.execute(Unknown Source)
at com.vertica.jdbc.VerticaCopyStream.execute(Unknown Source)

Caused by: com.vertica.support.exceptions.TransientException: [Vertica][VJDBC](5065) ERROR: Too many ROS containers exist for the following projections:
public.DATASET_TABLE_super (limit = 11264, ROS files = 11264, DV files = 0, new files = 11)


Vertica에서 COPY 방식으로 데이터를 로드 할경우 ROS(Read Optimized Store - On Disk) 에 직접 데이터를 저장합니다.
Direct 입력시에 ROS Container가 생성되며, 해당 Container에 데이터를 로드합니다.

이 과정에서 Batch size 건수를 설정하면, 하나의 ROS Container에 설정한 Batch 건수 만큼만 저장을 합니다.
Batch size 수를 너무 적게 설정할 경우, ROS Container가 무분별하게 많이 생길 수 있습니다.

예를 들어,
1,000,000 만건의 로우를 COPY 할때 Batch size 100개로 설정하면,  1,000,000/100 = 10,000개의 ROS Container를 생성하게 됩니다.

# 특정 테이블의 ROS Container 조회
dbadmin=> select * from STORAGE_CONTAINERS where projection_name like '%DATASET_TABLE%';
-[ RECORD 1 ]-------+-------------------------------------------------
node_name           | v_vmartdb_node0001
schema_name         | public
projection_id       | 45035996273822238
projection_name     | DATASET_TABLE_super
storage_type        | ROS
storage_oid         | 45035996275015557
sal_storage_id      | 02dff9215fa3b060135119b5ff6210ba00a000000013ff85
total_row_count     | 100    << 저장된 로우 건수
deleted_row_count   | 0
used_bytes          | 2394
start_epoch         | 39779
end_epoch           | 39779
grouping            | PROJECTION
segment_lower_bound | 0
segment_upper_bound | 4294967295
is_sorted           | t
location_label      |
delete_vector_count | 0


대량 데이터에 대해 Batch size를 너무 적게 지정한 경우, ContainersPerProjectionLimit 파라메터 설정값에 따라서 ROS Container 수가 초과될 경우 위와 같은 에러가 발생합니다.
대량의 데이터를 COPY Load 할때 Batch size를 지정할 경우에는 10,000건 이상 설정해가며 맞춰가는게 좋을것 같습니다.

ROS Container(ContainersPerProjectionLimit) 파라메터 값은 기본 1024개로 설정이 되어 있습니다.

# ContainersPerProjectionLimit 설정값 조회
SELECT *
FROM CONFIGURATION_PARAMETERS
WHERE parameter_name = 'ContainersPerProjectionLimit' ;
-[ RECORD 1 ]-----------------+---------------------------------------------------------------------------------------
node_name                     | ALL
parameter_name                | ContainersPerProjectionLimit
current_value                 | 1024
restart_value                 | 1024
database_value                | 1024
default_value                 | 1024
current_level                 | DEFAULT
restart_level                 | DEFAULT
is_mismatch                   | f
groups                        |
allowed_levels                | NODE, DATABASE
superuser_only                | f
change_under_support_guidance | t
change_requires_restart       | f
description                   | Number of ROS containers that are allowed before new ROSs are prevented (ROS pushback)


# ContainersPerProjectionLimit 설정값 변경
SELECT SET_CONFIG_PARAMETER('ContainersPerProjectionLimit', 2048);


# 1초 간격으로 ROS Container 건수 증가량 조회
[dbadmin@dw ~]$ while true; do vsql -d Database -w password -At -c "select count(*) from STORAGE_CONTAINERS where projection_name like '%DATASET_TABLE%';"; sleep 1; done;
16
.
.
.
964
978
988
1002
1016
1024   -- 1024개가 되는 시점에 에러가 발생 (ERROR: Too many ROS containers exist)

여기서 발생할 수 있는 예외적인 상황은,
ROS Container 기본 설정 건수에 도달 하기전에 Tuple Mover가 Mergeout을 진행 한다면 ROS Container 수가 줄어들기 때문에, 우연히 에러를 피해 갈 수도 있습니다.
Mergeout은 여러개로 생성된 ROS Containers의 데이터를 소수의 ROS Container로 합치고 줄이는 작업을 합니다.
Mergeout의 기본 수행 주기는 600초(10분)입니다.


# MoveOut, MergeOut Interval 파라메터 정보 조회
SELECT *
FROM CONFIGURATION_PARAMETERS
WHERE parameter_name IN('MoveOutInterval', 'MergeOutInterval');

-[ RECORD 1 ]-----------------+-----------------------------------------------------------------------
node_name                     | ALL
parameter_name                | MoveOutInterval
current_value                 | 300
restart_value                 | 300
database_value                | 300
default_value                 | 300
current_level                 | DEFAULT
restart_level                 | DEFAULT
is_mismatch                   | f
groups                        |
allowed_levels                | NODE, DATABASE
superuser_only                | f
change_under_support_guidance | f
change_requires_restart       | f
description                   | Interval between Tuple Mover checks for moveouts to perform (seconds)
-[ RECORD 2 ]-----------------+-----------------------------------------------------------------------
node_name                     | ALL
parameter_name                | MergeOutInterval
current_value                 | 600
restart_value                 | 600
database_value                | 600
default_value                 | 600
current_level                 | DEFAULT
restart_level                 | DEFAULT
is_mismatch                   | f
groups                        |
allowed_levels                | NODE, DATABASE
superuser_only                | f
change_under_support_guidance | f
change_requires_restart       | f
description                   | Interval between Tuple Mover checks for mergeouts to perform (seconds)


# MoveOut, MergeOut Interval 설정값 조회
dbadmin=> SELECT GET_CONFIG_PARAMETER('MoveOutInterval');
 GET_CONFIG_PARAMETER
----------------------
 300
(1 row)

dbadmin=> SELECT GET_CONFIG_PARAMETER('MergeOutInterval');
 GET_CONFIG_PARAMETER
----------------------
 600
(1 row)


# MoveOut, MergeOut Interval 설정 변경
dbadmin=> SELECT SET_CONFIG_PARAMETER('MoveOutInterval', 60);
    SET_CONFIG_PARAMETER    
----------------------------
 Parameter set successfully
(1 row)

dbadmin=> SELECT SET_CONFIG_PARAMETER('MergeOutInterval', 30);
    SET_CONFIG_PARAMETER    
----------------------------
 Parameter set successfully
(1 row)

> ALTER 설정 동일 결과
ALTER DATABASE mydb SET MoveOutInterval = 60;
ALTER DATABASE mydb SET MergeOutInterval = 30;


반응형
반응형

> 에러 내용


  File "/usr/lib/python2.6/site-packages/resource_management/core/logger.py", line 101, in filter_text

    text = text.replace(unprotected_string, protected_string)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 117: ordinal not in range(128)


> Python Default Encoding 수정

[root@peaceful ~]# vi /usr/lib/python2.6/site-packages/site.py
def __boot():
    import sys, imp, os, os.path
    sys.setdefaultencoding("utf-8")
    PYTHONPATH = os.environ.get('PYTHONPATH')
    if PYTHONPATH is None or (sys.platform=='win32' and not PYTHONPATH):
        PYTHONPATH = []
    else:
        PYTHONPATH = PYTHONPATH.split(os.pathsep)


반응형

+ Recent posts