During my work today, I was surprised by several errors I found using spark-sas7bdat package to ingest SAS datasets to Spark. These errors could have huge impact to enterprise data warehouse system, if you are using the current version of spark-sas7bat version 1.1.4-s_2.10 listed on maven to ingest data from SAS to Spark environment.
I will briefly explain three major problems I’ve found. Before I start, I want to state that I like spark-sas7bat package and no doubt it is a great free tool which saves users tons of money from buying official SAS license.
0. Experimental Setup
I used the following code based on SAS proc hadoop to move SAS dataset to HDFS:
%let hdpcfg = '/opt/sas94/bin/hadoop/conf/combined-site.xml'; proc hadoop cfg=&amp;hdpcfg verbose ; hdfs mkdir = '/hdfsuser/demo/'; hdfs copyfromlocal='/newdata/birthday.sas7bdat' out='/hdfsuser/demo/birthday.sas7bdat' overwrite; run;
And I used the following code in PySpark to save SAS dataset to Spark-Hive.
PYSPARK_DRIVER_PYTHON=$PYTHON_HOME/ipython pyspark --master yarn-client --packages saurfang:spark-sas7bdat:1.1.4-s_2.10
from pyspark.sql import HiveContext sqlContext=HiveContext(sc) df=sqlContext.read.format('com.github.saurfang.sas.spark').load("hdfs:////user/username/data.sas7bdat") df.printSchema() df.write.format('orc').saveAsTable('db1.table_data')
SAS version: 22.214.171.124
Spark version: 1.6.1
Python version: 2.7.12
Hadoop 126.96.36.199.4.2.0-258 hortonworks -r 13debf893a605e8a88df18a7d8d214f571e05289
1. SAS Date variables ingested in Hive are rolled back one day earlier
First create a simple data set in SAS:
data newdata.birthday; input @01 employee_id 6. @08 last_name $10. @19 birthday date9.; format employee_id 6. last_name $10. birthday date9.; datalines; 1247 Garcia 04APR1954 1078 Gibson 23APR1936 1005 Knapp 06OCT1938 1024 Mueller 17JUN1953 ;
Then using the script listed in Section 0 to ingest data to Hive, and the results are:
Notice that all the dates are rolled back one day earlier.
2. SAS Datetime variables ingested in Hive are rolled back some hours
data newdata.birthtime; input @01 employee_id 6. @08 last_name $10. @19 birthtime datetime21.2; format employee_id 6. last_name $10. birthtime datetime21.2; datalines; 1247 Garcia 04APR1954:08:00:00.00 1078 Gibson 23APR1936:16:00:00.00 1005 Knapp 06OCT1938:23:59:59.00 1024 Mueller 17JUN1953:00:00:00.01 ;
In hive, for obs 0 and 2, the datetime is rolled back 6 hours. For obs 1 and 3, the datetime is rolled back 5 hours. One guess is the timezone reference error and summertime adjustment(probably) causing this issue.
3. Missing value misrepresentation caused by SAS data set compression
I found when the original SAS dataset is large and contains missing values, the default configuration of “COMPRESS=TRUE” may lead to misrepresentations in HIVE.
Because the problem was found on a complicated data set during my work, and I cannot share it to the public, I will only list out the SAS script and the screen shots here. The SAS data set has many missing values and is composed of date, character, numerical, datetime features. The total number of features is 636.
Basically I extract the top 1k records from the original huge data set using two different compression options:
data mydata.demo_compressed_1k(compress=Yes); set mydata.originaldata (obs=1000); run;
data mydata.demo_uncompressed_1k(compress=NO); set mydata.originaldata (obs=1000); run;
The results are shown in the next two figures. The top one is the correct data (uncompressed) and the bottom figure is the mistaken data (compressed).
As shown, due to compression, the NULL values are replaced by 0, and sometimes random extremely large numbers. The risk here is in SAS, Compression=YES is a default setting, so unless users explicitly turn that off, all datasets will be compressed automatically and can be totally mistaken in HIVE after ingestion.
I love to see the respond from the author of spark-sas7bat package and hope these issues can be solved. It is no doubt a great open source package and hope it becomes more reliable. However, the impact of the existing issues can be huge for all ingested data and it may affect lots of business applications. Be aware of this and do some necessary sanity check if you think your applications and data warehouse may have been affected!