Be aware of hidden data errors using spark-sas7bdat pacakge to ingest SAS datasets to Spark

During my work today,  I was surprised by several errors I found using spark-sas7bdat package to ingest SAS datasets to Spark.  These errors could have huge impact to enterprise data warehouse system, if you are using the current version of spark-sas7bat version 1.1.4-s_2.10 listed on maven to ingest data from SAS to Spark environment.

I will briefly explain three major problems I’ve found.   Before I start,  I want to state that I like spark-sas7bat package and no doubt it is a great free tool which saves users tons of money from buying official SAS license.

0.  Experimental Setup

I used the following code based on SAS proc hadoop to move SAS dataset to HDFS:


%let hdpcfg = '/opt/sas94/bin/hadoop/conf/combined-site.xml';
proc hadoop cfg=&hdpcfg verbose ;
hdfs mkdir = '/hdfsuser/demo/';
 hdfs copyfromlocal='/newdata/birthday.sas7bdat'
 out='/hdfsuser/demo/birthday.sas7bdat' overwrite;
run;

And I used the following code in PySpark to save SAS dataset to Spark-Hive.


PYSPARK_DRIVER_PYTHON=$PYTHON_HOME/ipython pyspark --master yarn-client --packages saurfang:spark-sas7bdat:1.1.4-s_2.10

from pyspark.sql import HiveContext
sqlContext=HiveContext(sc)
df=sqlContext.read.format('com.github.saurfang.sas.spark').load("hdfs:////user/username/data.sas7bdat")
df.printSchema()
df.write.format('orc').saveAsTable('db1.table_data')

SAS version: 9.4.1.0

Spark version: 1.6.1

Python version: 2.7.12

Hadoop 2.7.1.2.4.2.0-258  hortonworks -r 13debf893a605e8a88df18a7d8d214f571e05289

1.  SAS Date variables ingested in Hive are rolled back one day earlier

First create a simple data set in SAS:

 

data newdata.birthday;
 input @01 employee_id 6.
 @08 last_name $10.
 @19 birthday date9.;
 format employee_id 6.
 last_name $10.
 birthday date9.;
 datalines;
 1247 Garcia 04APR1954
 1078 Gibson 23APR1936
 1005 Knapp 06OCT1938
 1024 Mueller 17JUN1953
;

Then using the script listed in Section 0 to ingest data to Hive, and the results are:

employee_id last_name birthday
0 1247.0 Garcia 1954-04-03
1 1078.0 Gibson 1936-04-22
2 1005.0 Knapp 1938-10-05
3 1024.0 Mueller 1953-06-16

Notice that all the dates are rolled back one day earlier.

2.  SAS Datetime variables ingested in Hive are rolled back some hours


data newdata.birthtime;
 input @01 employee_id 6.
 @08 last_name $10.
 @19 birthtime datetime21.2;
 format employee_id 6.
 last_name $10.
 birthtime datetime21.2;
 datalines;
 1247 Garcia 04APR1954:08:00:00.00
 1078 Gibson 23APR1936:16:00:00.00
 1005 Knapp 06OCT1938:23:59:59.00
 1024 Mueller 17JUN1953:00:00:00.01
;

In hive, for obs 0 and 2, the datetime is rolled back 6 hours. For obs 1 and 3, the datetime is rolled back 5 hours.  One guess is the timezone reference error and summertime adjustment(probably) causing this issue.

obs employee_id last_name birthtime
0 1247.0 Garcia 1954-04-04 02:00:00.0
1 1078.0 Gibson 1936-04-23 11:00:00.0
2 1005.0 Knapp 1938-10-06 17:59:59.0
3 1024.0 Mueller 1953-06-16 19:00:00.01

3.  Missing value misrepresentation caused by SAS data set compression

I found when the original SAS dataset is large and contains missing values, the default configuration of  “COMPRESS=TRUE” may lead to misrepresentations in HIVE.

Because the problem was found on a complicated data set during my work, and I cannot share it to the public, I will only list out the SAS script and the screen shots here. The SAS data set has many missing values and is composed of date, character, numerical, datetime features. The total number of features is 636.

Basically I extract the top 1k records from the original huge data set using two different compression options:


data mydata.demo_compressed_1k(compress=Yes);
 set mydata.originaldata (obs=1000);
run;


data mydata.demo_uncompressed_1k(compress=NO);
 set mydata.originaldata (obs=1000);
run;

The results are shown in the next two figures. The top one is the correct data (uncompressed) and the bottom figure is the mistaken data (compressed).

correct

mistaken

As shown, due to compression, the NULL values are replaced by 0, and sometimes random extremely large numbers. The risk here is in SAS, Compression=YES is a default setting, so unless users explicitly turn that off, all datasets will be compressed automatically and can be totally mistaken in HIVE after ingestion.

Final words

I love to see the respond from the author of spark-sas7bat package and hope these issues can be solved. It is no doubt a great open source package and hope it becomes more reliable. However, the impact of the existing issues can be huge for all ingested data and it may affect lots of business applications.  Be aware of this and do some necessary sanity check if you think your applications and data warehouse may have been affected!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s