Getting started

Install CueLake

CueLake uses Kubernetes kubectl for installation. Create a namespace and then install using the cuelake.yaml file. Creating a namespace is optional. You can install in the default namespace or in any existing namespace.

In the commands below, we use cuelake as the namespace.

kubectl create namespace cuelake
kubectl apply -f https://raw.githubusercontent.com/cuebook/cuelake/main/cuelake.yaml -n cuelake
kubectl port-forward services/lakehouse 8080:80 -n cuelake

To test if the installation is successful, run the following command

kubectl get pods -n cuelake

You should see 3 pods in running status, something like below:

NAME                               READY   STATUS    RESTARTS   AGE
lakehouse-74cd5d759b-8pj5c         1/1     Running   0          1m
redis-69cc674bf8-rsd74             1/1     Running   0          1m
zeppelin-server-main-865974dc55-rt9vg   3/3     Running   0          1m

Now visit http://localhost:8080 in your browser.

If you don’t want to use Kubernetes and instead want to try it out on your local machine first, we’ll soon have a docker-compose version. Let us know if you’d want that sooner.

Add Connection

Go to the Connections screen.
Click on New Connection.
Select your source database type.
Enter your access credentials and click Add Connection.

Configure AWS

Provide access to S3

In your AWS console, go to Elastic Kubernetes Service.

EKS Clusters

Choose your cluster and go to the Configuration tab. Under Configuration, go to the Compute tab.

Node Group

Select your Node Group. Click on the Node IAM Role ARN to go to the IAM Roles screen.

Node IAM Role ARN

Attach AmazonS3FullAccess policy.

Setup S3 buckets

In your AWS console, create a new S3 bucket. CueLake will use this bucket to store Iceberg and Spark tables.

Next setup this S3 bucket as the warehouse location in Spark. Go to the Settings screen in CueLake and click on Interpreter Settings tab. Search for spark interpreter. In the spark interpreter, click on edit.

Node IAM Role ARN

Go to the bottom of the paragraph. Enter the S3 bucket path in two properties as below. spark.sql.warehouse.dir property is used for Spark tables. spark.sql.catalog.cuelake.warehouse property is used for iceberg tables.

spark.sql.warehouse.dir	                s3a://<YourBucketName>/warehouse
spark.sql.catalog.cuelake.warehouse	    s3a://<YourBucketName>/cuelake

Click Save and restart the interpreter.

Add New Notebook

Go to the Notebooks screen and click on New Notebook
Select Incremental Refresh template
Select the Source Connection
Enter the SQL Select statement
In the Timestamp column, enter the column name of the incremental column. This must be of data type Timestamp.
Enter the Primary Key Column
Enter a name for the destination table.
Give a name to the notebook and click Create Notebook.

Notebook Form

Now Run the notebook. This will create a new iceberg table in your S3 bucket. Note that first run will load the historical data as defined in the SQL query.

Once the run is successful, you can query data from the newly created table. Go to the Notebooks screen and click on New Notebook. Select Blank template, give a name to the notebook and click Create Notebook. In the Notebooks screen, click the Notebook icon for your notebook to open the Zeppelin notebook. Your iceberg table can be queried as cuelake.<your-destination-table-name>.

Merge Incremental data

To upsert incremental data into your S3 table, run the above notebook again, after a few rows have been inserted or updated. To confirm that merge is successful, query the S3 table.