For some, getting into data analytics outside of an academic or work environment can be very challenging - where do you start? Which database do you use? And how do you do it for low or zero cost?
In this article, I am going to walk through setting up your VM1 & database, connecting to your new remote server using Azure Data Studio, and as a bonus, connecting it to dbt. I've also written about setting up dbt on windows on a previous post.
First, let's talk about requirements & recommendations:
- This tutorial is focused on Windows 10 + Linux. You will need Windows 10 Pro where you install your VM.
- I recommend that you set up your database on different physical machine than your dev machine. You should probably have at least 32GB of RAM.
- Since we are installing the database on another machine, that machine needs to be on the same network as your development machine.
Why use a VM at all? In my experience, running a database on your dev machine makes everything extremely slow. Your database will be very greedy with resources (RAM specifically) - so keeping it in a little box that you can turn on and off allows you to keep using your machine "as normal".
Step 1: Enable HyperV
Open powershell as administrator and run the following command:
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All
More info can be found here: https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/enable-hyper-v
Step 2: Create a VM in HyperV
You will need to restart your machine in order to use the HyperV features, so machine sure to do that first. The Microsoft documents to create a VM are exellent - and linked below. Make sure to select Ubuntu 20.04 when you create it.
Step 3: Install SQL Server on your VM
We will do the install of SQL Server2 in the CLI on Ubuntu, which MS has laid out again very nicely in their documentation. A couple of notes when walking through this:
- Make sure to select "SQL Server Express" as your edition. It limits your database size to 9GB but is otherwise relatively unencumbered by MS licensing.
- Write down your SA password. You will need it later when connecting.
This is quite detailed, so head over to this link and follow the instructions in detail: https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-setup?view=sql-server-ver15
Step 4: Update the settings of your virtual switch
The default settings inside HyperV is for an "internal network" on your VM. This is fine if you are accessing your VM from the machine its running on, but the whole point here is that you want it to be a "remote server". Set the virtual switch to "external network" and you can then access your VM from any machine on your network.
Again, MS has great documentation on this here: https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/get-started/create-a-virtual-switch-for-hyper-v-virtual-machines
Step 5: Install Azure Data Studio on your dev machine - and write some SQL!
On your dev machine, make sure you can ping your VM. In my case, my VM is named "jacob-virtual-machine", so the command to validate I can reach it is:
If you can't ping your VM, you have some networking issues to sort out. While I am no expert here, you will want to make sure you can see your VM outside the host (Step 4, above) and that port 1433 is open on the host and the VM.
Once that is resolved, you can download and install Azure Data Studio3. Now, with the credentials from above and you VM name, you can connect to your remote server. Everything can be left on defaults, but the avoidance of doubt, check out my connection settings below.
Now you have it all working and you have your own nice empty database to play with!
Bonus Content: Connect dbt to SQL Server
For those of you wishing to use dbt with SQL Server, check out the dbt-sqlserver github. It has great details, but I'll summarize the key bits.
You will need to install the dbt connector:
pip install dbt-sqlserver
I also find their explanation of the profiles.yml file kind of confusing, so I've included my own below for reference:
local_sql: target: dev outputs: dev: type: sqlserver driver: 'ODBC Driver 17 for SQL Server' server: <VM name> database: <database name> port: 1433 schema: <schema name> user: <username> password: <password>
1 You can also probably do this with WSL2, and not install a Linux VM. However, I am going to be running more software on the VM later and I want to split it to another machine. You can also use docker over top of all of this, which I may cover in another post.
2 I'm choosing SQL Server for a couple reasons: I am familiar with it and the documentation and community are large. PostgreSQL also works here, which has the advantage of having a default dbt connector.
3 SSMS works here too, but Azure Data Studio has the advantage of being cross platform. If you are using dbt, you need a SQL runner anyway as the VS code options aren't great.