MEMO: Ingesting SAS datasets to Spark/Hive

In SAS (assuming integration with Hadoop), export the dataset to HDFS using proc hadoop:

%let hdpcfg = '/opt/sas94/bin/hadoop/conf/combined-site.xml';
proc hadoop cfg=&hdpcfg verbose ;
  hdfs mkdir = '/user/username/';
  hdfs copyfromlocal='/sasfolder/data.sas7bdat'
  out='/user/username/data.sas7bdat' overwrite;
  run;

Start Spark cluster with spark-sas7bdat package:

PYSPARK_DRIVER_PYTHON=$PYTHON_HOME/ipython pyspark \ 
--master yarn-client \ 
--num-executors 10 \ 
--driver-memory 10G \ 
--executor-memory 10G \ 
--executor-cores 10 \ 
--queue team --conf spark.driver.maxResultSize=10G \
--packages saurfang:spark-sas7bdat:1.1.4-s_2.10

In PySpark, execute the following command:

from pyspark.sql import HiveContext
sqlContext=HiveContext(sc)
df=sqlContext.read.format('com.github.saurfang.sas.spark').load("hdfs:////user/username/data.sas7bdat")
df.printSchema()
df.write.format('orc').saveAsTable('db1.table_data')

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s