Automating the provisioning of Active Directory labs in Azure

Today, I’m releasing Adaz, a project aimed at automating the provisioning of hunting-oriented Active Directory labs in Azure. This post is the making of, where we walk through how to leverage Terraform and Ansible to spin up full-blown Active Directory environments with Windows Server 2019 and Windows 10 machines.


Introduction

Meet Adaz

After a few weeks of work I’m happy to release Adaz, a project allowing to easily spin up Active Directory labs for hunting, testing threat detection methods, and more generally having fun with AD and Windows without having to deal with all the time cost that goes with it.

https://github.com/christophetd/Adaz

How simple is it, you ask?

$ terraform apply -var 'region=West US'

Wait 15-20 minutes for the magic to happen and that’s it! You have an Active Directory environment with:

  • A Windows Server 2019 DC
  • Domain-joined Windows 10 workstations
  • Sysmon, Windows Event Forwarding, Audit log policies pre-configured
  • A Kibana + Elasticsearch instance with your Windows logs ready to be queried
####################
###  WHAT NEXT?  ###
####################

Check out your logs in Kibana: 
http://52.188.70.68:5601

RDP to your domain controller: 
xfreerdp /v:52.188.70.141 /u:hunter.lab\\hunter '/p:Hunt3r123.' +clipboard /cert-ignore

RDP to a workstation:
xfreerdp /v:13.90.29.138 /u:localadmin '/p:Localadmin!' +clipboard /cert-ignore


workstations_public_ips = {
  "DANY-WKS" = "52.188.70.147"
  "XTOF-WKS" = "13.90.29.138"
}

Tooling landscape

Chris Long made an excellent job building DetectionLab, which a similar and more complete project. It supports running on AWS and on VirtualBox/VMWare workstation and can be provisioned either with Terraform and Vagrant. It’s also impossible to talk about labs without mentioning HELK made by Roberto Rodriguez.

So, you may ask, why are you rolling your own lab? Here are a few reasons:

  • It’s fun, and I learned quite a lot on Azure / Terraform / Ansible / Packer in the process
  • I have free credits on Azure – not on AWS
  • DetectionLab uses Splunk. I wanted to use Elasticsearch for potential future integration with tools like Elastalert and Sigma rules
  • DetectionLab doesn’t support having multiple Windows 10 workstations, making it harder to test things like lateral movement techniques
  • The HELK architecture is quite complex
  • HELK is based on Docker and hence doesn’t provide Windows 10/Windows Server VMs
  • My main use-case is being able to spin up a complete AD environment with basic hunting capabilities in a short amount of time, and destroying it a few hours later

Making of

The rest of this post is the making of Adaz and shows in detail how we can leverage Terraform/Ansible to build AD labs in Azure.

What do we want?

Before we get started, let’s define what we’d like to build. Since our goal is to mimic an enterprise-like Windows environment, we want at least:

  • A domain controller running Windows Server
  • Domain-joined workstation(s) running Windows 10

We would also like to leverage the lab to search through the Windows event logs. In order to do that, we’ll want:

  • Windows Event Forwarding, which is a built-in way to centralize logs collection
  • An Elasticsearch instance to store the logs into
  • Kibana, to have a nice UI to run queries

Where do we want it?

We have several choices about where to run the lab.

  • Locally on our laptop, running in Virtualbox or VMWare Workstation
  • On an enterprise-ready virtualization platform such as ESXi, Proxmox, or Hyper-V
  • On a public cloud provider such as AWS, Azure or GCP

Although we all have different needs and use-cases, mine is to be able to spin up a lab for a few hours/days with minimal overhead and hardware, which excludes enterprise-like hypervisors. Next, I don’t want to be restricted by the RAM and especially by the disk space available on my machine, ruling out the first option.

That leaves us with the public cloud choice! How do we choose between AWS, Azure, and GCP? Easy one: AWS and GCP don’t provide the ability to run Windows 10 VMs*, so we’re left with Azure.

(* Technically, you can run Windows 10 VMs in AWS if you build your own AMI and bring your own Windows 10 license…)

You can create an Azure subscription here. Microsoft gives you $200 worth of credits for the first 30-day period. Alternatively, if your organization uses Microsoft technologies, there’s a chance they can give you a Visual Studio subscription which includes a monthly amount of Azure credits. Mine provided me with a Visual Studio Dev Essentials account which gives $50/month.

Tools and technologies

We’ll be using the following tools:

  • Terraform and its Azure provider to create resources in Azure: virtual machines, networks, disks, etc. Terraform allows us to nicely define the desired state of our infrastructure “as code” using a declarative rather than an imperative style. Instead of saying “create a VM for me“, we write “make sure a VM with these specifications exists” and let Terraform do the work.
  • Ansible to provision our virtual machines once they’re up and running. Ansible allows us to configure both Linux and Windows machines, using respectively SSH and WinRM to establish a remote connection. Especially, we’ll make heavy use of the Ansible modules for Windows.

Note: If you’re familiar with the DevOps tooling ecosystem, you’re probably wondering why we’re not throwing Packer in the mix. See “Why not use Packer?

Configuring Terraform for Azure

The initial configuration is pretty straightforward:

  • Download the Azure CLI and run az login
  • Create a file provider.tf with the following code:
provider "azurerm" {
  # The version is constantly evolving, make sure to check https://github.com/terraform-providers/terraform-provider-azurerm/releases
  version = "=2.12.0"
  features {}
}
  • Run terraform init
  • You’re good to go!

Creating our domain controller

We’ll start with the creation of a Windows Server 2019 VM which we’ll use as a domain controller.

Prerequisite: networking

Before doing so, we need:

  • A resource group to place our resources into
  • A virtual network with a subnet
  • A network interface to attach to our DC. We’ll assign it both a private IP in our subnet and a public IP so we can easily access it remotely

Each of these directly maps to an Azure resource. Here’s how to define this with Terraform:

# Resource group
resource "azurerm_resource_group" "resourcegroup" {
    name     = "ad-lab-resource-group"
    location = "West Europe"
}

# Virtual network 10.0.0.0/16
resource "azurerm_virtual_network" "network" {
  name                = "virtual-network"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.resourcegroup.location
  resource_group_name = azurerm_resource_group.resourcegroup.name
}

# Subnet 10.0.0.0/24
resource "azurerm_subnet" "internal" {
  name                 = "subnet"
  resource_group_name  = azurerm_resource_group.resourcegroup.name
  virtual_network_name = azurerm_virtual_network.network.name
  address_prefixes     = ["10.0.0.0/24"]
}

# Network interface for the DC
resource "azurerm_network_interface" "dc_nic" {
  name                = "domain-controller-nic"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name

  ip_configuration {
    name                          = "static"
    subnet_id                     = azurerm_subnet.servers.id
    private_ip_address_allocation = "Static"
    private_ip_address = cidrhost(var.servers_subnet_cidr, 10)
    public_ip_address_id = azurerm_public_ip.main.id
  }
}

We run terraform apply and let Terraform create the resources for us.

It would be nicer if Terraform would print the public IP it assigned for our network interface, so let’s add a Terraform output:

output "domain_controller_public_ip" {
  value = azurerm_public_ip.domain_controller.ip_address
}

When we terraform apply, we can now see the public IP in the console output.

Outputs:

domain_controller_public_ip = 13.94.215.14

If you’re thinking right now: “Is this guy really exposing a domain controller to the Internet? and he pretends to work in security?!” – you’re totally right. Since this is a lab environment with (hopefully) no sensitive data in it, let’s consider that whitelisting our outgoing public IP is “good enough”.

In order to do this, we will create a Network Security Group and attach it to the network interface of our domain controller. We’ll allow RDP and WinRM traffic. Note that we could also attach the NSG directly to the subnet, and it would apply to any machine in that subnet.

# Note: you'll need to run 'terraform init' before terraform apply-ing this, because 'http' is a new provider

# Dynamically retrieve our public outgoing IP
data "http" "outgoing_ip" {
  url = "http://ipv4.icanhazip.com"
}
locals {
  outgoing_ip = chomp(data.http.outgoing_ip.body)
}

# Network security group
resource "azurerm_network_security_group" "domain_controller" {
  name                = "domain-controller-nsg"
  location            = azurerm_resource_group.resourcegroup.location
  resource_group_name = azurerm_resource_group.resourcegroup.name

  # RDP
  security_rule {
    name                       = "Allow-RDP"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "3389"
    source_address_prefix      = "${local.outgoing_ip}/32"
    destination_address_prefix = "*"
  }

  # WinRM
  security_rule {
    name                       = "Allow-WinRM"
    priority                   = 101
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "5985"
    source_address_prefix      = "${local.outgoing_ip}/32"
    destination_address_prefix = "*"
  }
}

# Associate our network security group with the NIC of our domain controller
resource "azurerm_network_interface_security_group_association" "domain_controller" {
  network_interface_id      = azurerm_network_interface.dc_nic.id
  network_security_group_id = azurerm_network_security_group.domain_controller.id
}

We’re good to go!

Creating a Windows Server 2019 VM

This page lists some commonly used Windows Server base images. We’ll use MicrosoftWindowsServer:2019-Datacenter:latest. In addition, we set a randomly-generated initial administrator password and enable WinRM.

# Note: you'll need to run 'terraform init' before terraform apply-ing this, because 'random_password' is a new provider
# Generate a Random password for our domain controller
resource "random_password" "domain_controller_password" {
  length = 16
}
# ... and make sure it's shown to us in the console output of 'terraform apply'
output "domain_controller_password" {
  value = random_password.domain_controller_password.result
}

# VM for our domain controller
resource "azurerm_virtual_machine" "domain_controller" {
  name                  = "domain-controller"
  location              = azurerm_resource_group.resourcegroup.location
  resource_group_name   = azurerm_resource_group.resourcegroup.name
  network_interface_ids = [azurerm_network_interface.dc_nic.id]
  # List of available sizes: https://docs.microsoft.com/en-us/azure/cloud-services/cloud-services-sizes-specs
  vm_size               = "Standard_D1_v2"

  # Base image
  storage_image_reference {
    publisher = "MicrosoftWindowsServer"
    offer     = "WindowsServer"
    sku       = "2019-Datacenter"
    version   = "latest"
  }
  
  # Disk
  delete_os_disk_on_termination = true
  storage_os_disk {
    name              = "domain-controller-os-disk"
    create_option     = "FromImage"
  }

  os_profile {
    computer_name  = "DC-1"
    # Note: you can't use admin or Administrator in here, Azure won't allow you to do so :-)
    admin_username = "christophe"
    admin_password = random_password.domain_controller_password.result
  }
  os_profile_windows_config {
    # Enable WinRM - we'll need to later
    winrm {
      protocol = "HTTP"
    }
  }

  tags = {
    kind = "domain_controller"
  }
}

Now sit back and relax. The creation of a Windows virtual machine is quite slow and should take between 2 and 5 minutes. Once it’s done, we can RDP to the machine using its public IP and the randomly generated administrator password.

Apply complete! Resources: 2 added, 0 changed, 0 destroyed.

Outputs:

domain_controller_password = ALj}%ocW&y?uTWZG
domain_controller_public_ip = 13.94.215.14

Using xfreerdp, for instance:

xfreerdp /v:13.94.215.14 /u:christophe '/p:ALj}%ocW&y?uTWZG'  /w:1700 /h:1000 +clipboard

Setting up Ansible to configure the domain controller

Now that we have a domain controller up and running, we’d like to configure it. Ansible can provision Windows machines using WinRM, and we’ll use the Azure dynamic inventory plugin, which will allow us to automatically target our virtual machines without hardcoding any IP address in the Ansible configuration.

# group_vars/domain_controllers
ansible_connection: winrm
ansible_winrm_transport: ntlm
ansible_winrm_scheme: http
ansible_winrm_port: 5985
# inventory_azure_rm.yml
plugin: azure_rm
auth_source: cli
include_vm_resource_groups:
- ad-lab-resource-group
conditional_groups:
  # Place every VM with the tag "kind" == "domain_controller" in the "domain_controllers" Ansible host group
  domain_controllers: "tags.kind == 'domain_controller'"
# ansible.cfg
[defaults]
inventory=./inventory_azure_rm.yml
nocows=1

Let’s create a very basic playbook creating a file C:\hello.txt to confirm Ansible is able to connect to our machine and run it.

---
- name: Configure domain controllers
  hosts: domain_controllers
  gather_facts: no
  tasks:
  - name: Create test file
    win_file:
      path: C:\hello.txt
      state: touch
$ ansible-playbook dc.yml --inventory inventory_azure_rm.yml \
  -e AZURE_RESOURCE_GROUPS=ad-lab-resource-group \
  --user christophe --ask-pass

PLAY [Configure domain controllers]

TASK [Create test file] 
changed: [domain-controller_7e4d]

PLAY RECAP
domain-controller_7e4d     : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Provisioning a Windows domain with Ansible

Now that we have our domain controller up and running, the next step is to create a Windows domain. Ansible offers modules precisely for this purpose:

---
- name: Configure domain controllers
  hosts: domain_controllers
  gather_facts: no
  vars:
    domain_name: christophe.lab
    domain_admin: christophe
    domain_admin_password: "{{ ansible_password }}"
    safe_mode_password: "This_should_have_been_a_randomly_generated_password:("

  tasks:
  - name: Ensure domain is created
    win_domain:
      dns_domain_name: "{{ domain_name }}"
      safe_mode_password: "{{ safe_mode_password }}"
    register: domain_creation

  - name: Reboot if domain was just created
    win_reboot: {}
    when: domain_creation.reboot_required

  - name: Ensure domain controllers are promoted
    win_domain_controller:
      dns_domain_name: "{{ domain_name }}"
      domain_admin_user: "{{ domain_admin }}@{{ domain_name }}"
      domain_admin_password: "{{ domain_admin_password }}"
      safe_mode_password: "{{ safe_mode_password }}"
      state: domain_controller
      log_path: C:\Windows\Temp\promotion.txt
    register: dc_promotion

  - name: Reboot if server was just promoted to a domain controller
    win_reboot: {}
    when: dc_promotion.reboot_required

This isn’t fast because creating a domain and promoting a server to a domain controller are non-trivial operations and both require a full reboot. Expect that it will take around 10 minutes.

Once this playbook has been run, note that you’ll need to specify the name of the domain (here, christophe.lab) when connecting via RDP, e.g.

xfreerdp /v:13.81.2.38 /u:christophe.lab\\christophe /p:LbeR5EHKJxXzAqaJ /cert-ignore

Now that we have a proper Windows domain and a domain controller, let’s add some domain-joined workstations!

Adding Windows 10 workstations

The creation of the Windows 10 VM and NIC is very similar so we won’t go into too much detail. The most challenging part for me was to find the right reference for the image base, the solution being to use az vm image list -f "Windows-10" --all to find the right image for Windows 10 Pro 1909 N.

Another interesting capability of Terraform that we can leverage is its ability to create multiple instances of a resource. This is useful for workstations because we’ll typically want several of them. Let’s store this information in a Terraform variable.

variable "num_workstations" {
  description = "Number of workstations to create"
  type = number
  default = 2
}

Let’s now create a network security group for our workstations, and a network interface with a public IP for each one of them.

# Network security group
resource "azurerm_network_security_group" "workstations" {
  name                = "workstations-nsg"
  # Rest is the same as for the domain controller NSG!
}

# Create 1 public IP per workstation
resource "azurerm_public_ip" "workstation" {
  count = var.num_workstations

  name                    = "workstation-${count.index + 1}-public-ip"
  location                = azurerm_resource_group.resourcegroup.location
  resource_group_name     = azurerm_resource_group.resourcegroup.name
  allocation_method       = "Static"
}

# Create 1 NIC per workstation
resource "azurerm_network_interface" "workstations_nic" {
  count = var.num_workstations

  name                = "workstation-${count.index + 1}-nic"
  location            = azurerm_resource_group.resourcegroup.location
  resource_group_name = azurerm_resource_group.resourcegroup.name

  ip_configuration {
    name                          = "static"
    subnet_id                     = azurerm_subnet.internal.id
    private_ip_address_allocation = "Static"
    private_ip_address            = cidrhost("10.0.0.128/25", 100 + count.index)
    public_ip_address_id          = azurerm_public_ip.workstation[count.index].id
  }
}

# Associate our network security group with the NIC of our workstations
resource "azurerm_network_interface_security_group_association" "workstations" {
  count = var.num_workstations

  network_interface_id      = azurerm_network_interface.workstations_nic[count.index].id
  network_security_group_id = azurerm_network_security_group.workstations.id
}

This will create the following resources in Azure:

The interesting bits are:

  • count = var.num_workstations: repeat the current resource var.num_workstations times
  • private_ip_address = cidrhost("10.0.0.128/25", 100 + count.index): using the cidrhost function, we assign IP addresses 10.0.0.128, 10.0.0.129, 10.0.0.130… to each workstation

Let’s now create our actual Windows 10 workstation VMs!

# Generate a random password and reuse it for each local admin account on workstations
resource "random_password" "workstations_local_admin_password" {
  length  = 16
  special = false
}

# Window 10 workstations
resource "azurerm_virtual_machine" "workstation" {
  count = var.num_workstations
  
  name                  = "workstation-${count.index + 1}"
  location              = azurerm_resource_group.resourcegroup.location
  resource_group_name   = azurerm_resource_group.resourcegroup.name
  network_interface_ids = [azurerm_network_interface.workstations_nic[count.index].id]
  # List of available sizes: https://docs.microsoft.com/en-us/azure/cloud-services/cloud-services-sizes-specs
  vm_size               = "Standard_D1_v2"


  storage_image_reference {
    # az vm image list -f "Windows-10" --all
    publisher = "MicrosoftWindowsDesktop"
    offer     = "Windows-10"
    sku       = "19h1-pron"
    version   = "latest"
  }

  delete_os_disk_on_termination = true
  storage_os_disk {
    name              = "workstation-${count.index + 1}-os-disk"
    create_option     = "FromImage"
  }

  os_profile {
    computer_name  = "WORKSTATION-${count.index + 1}"
    admin_username = "localadmin"
    admin_password = random_password.workstations_local_admin_password.result
  }
  os_profile_windows_config {
      winrm {
        protocol = "HTTP"
      }
  }

  tags = {
    kind = "workstation"
  }
}

Just like before, we’d like to have some visibility on the public IPs and the local administrator password generated for us, so let’s create an output for these:

output "workstations_public_ips" {
  value = azurerm_public_ip.workstation.*.ip_address
}

output "workstations_local_admin_password" {
  value = random_password.workstations_local_admin_password.result
}
Outputs:

[...]
workstations_local_admin_password = "Uxae3qltIGyxg6WA"

workstations_public_ips = [
  "13.80.143.17",
  "52.166.236.117",
]

With a little extra effort, we can even output a ready-to-use command line to RDP in the workstations:

output "workstations_rdp_commandline" {
  value = {
    for i in range(var.num_workstations):
    "workstation-${i + 1}" => "xfreerdp /v:${azurerm_public_ip.workstation[i].ip_address} /u:localadmin /p:${random_password.workstations_local_admin_password.result}  /w:1100 /h:650 +clipboard /cert-ignore"
  }
}
Outputs:

[...]
workstations_rdp_commandline = {
  "workstation-1" = "xfreerdp /v:13.80.143.17 /u:localadmin /p:Uxae3qltIGyxg6WA  /w:1100 /h:650 +clipboard /cert-ignore"
  "workstation-2" = "xfreerdp /v:52.166.236.117 /u:localadmin /p:Uxae3qltIGyxg6WA  /w:1100 /h:650 +clipboard /cert-ignore"
}

Et voilà, after waiting for a few minutes we have our workstations! In my experiments, spinning up the workstations consistently takes around 5 minutes. That said since the creation of the VMs is parallelized by Terraform, it should take the same amount of time whether we provision 1 or 10 workstations. We can also easily scale up or scale down our lab by changing the num_workstations variable:

$ terraform apply -var 'num_workstations=1'
# Scale up!
$ terraform apply -var 'num_workstations=5'
# Scale down and remove 2 workstations
$ terraform apply -var 'num_workstations=3'

Joining workstations to the domain with Ansible

Ansible has a handy module, win_domain_membership, allowing to manage the domain membership status of a machine. All we need to do is therefore to map our kind: workstations tag in the dynamic Ansible inventory and write the appropriate playbook to ensure workstations are domain-joined.

# inventory_azure_rm.yml
plugin: azure_rm
auth_source: cli
include_vm_resource_groups:
- ad-lab-resource-group
conditional_groups:
  # Place every VM with the tag "kind" == "domain_controller" in the "domain_controllers" Ansible host group
  domain_controllers: "tags.kind == 'domain_controller'"

  # Same for workstations 
  workstations: "tags.kind == 'workstation'"
# group_vars/workstations
ansible_connection: winrm
ansible_winrm_transport: ntlm
ansible_winrm_scheme: http
ansible_winrm_port: 5985
# workstations.yml
---
- name: Configure workstations
  hosts: workstations
  vars:
    # Note: these should ideally placed into a shared variable file (such as group_vars/all)
    # to avoid duplication with the DC playbook
    domain_name: christophe.lab
    domain_admin: christophe
  vars_prompt:
  - name: domain_admin_password
    prompt: "Domain admin password"
  tasks:
  - name: Set DC as DNS server
    win_dns_client:
      adapter_names: '*'
      ipv4_addresses: "{{ hostvars[groups['domain_controllers'][0]].private_ipv4_addresses }}"

  - name: Ensure workstation is domain-joined
    win_domain_membership:
      dns_domain_name: "{{ domain_name }}"
      hostname: "{{ ansible_env.COMPUTERNAME }}"
      domain_admin_user: "{{ domain_admin }}@{{ domain_name }}"
      domain_admin_password: "{{ domain_admin_password }}"
      state: domain
    register: domain_state

  - name: Reboot machine if it has just joined the domain
    win_reboot: {}
    when: domain_state.reboot_required

Result:

$ ansible-playbook workstations.yml --inventory inventory_azure_rm.yml \
    -e AZURE_RESOURCE_GROUPS=ad-lab-resource-group \
    --user localadmin --ask-pass

SSH password:
Domain admin password:
                                           
PLAY [Configure workstations] 

TASK [Gathering Facts] 

ok: [workstation-2_2760]
ok: [workstation-1_3271]

TASK [Set DC as DNS server] 
ok: [workstation-1_3271]
ok: [workstation-2_2760]

TASK [Ensure workstation is domain-joined] 
changed: [workstation-1_3271]
changed: [workstation-2_2760]

TASK [Reboot machine if it has just joined the domain] 
changed: [workstation-1_3271]
changed: [workstation-2_2760]

PLAY RECAP 
workstation-1_3271         : ok=4    changed=2    ...
workstation-2_2760         : ok=4    changed=2    ...

Our workstations are now domain-joined:

Automating the provisioning of virtual machines after creation

The process is still a bit cumbersome because to provision our lab we need to run terraform apply, and then run both Ansible playbooks manually. We can make things smoother by using the local-exec Terraform provisioner allowing us to run a local command when a resource is created. In our case, we’ll leverage it to automatically run Ansible against our domain controller upon creation:

resource "azurerm_virtual_machine" "domain_controller" {
  name = "domain-controller"
  
  # ...

  provisioner "local-exec" {
    command = "ansible-playbook dc.yml --user christophe -e ansible_password=${random_password.domain_controller_password.result} -e AZURE_RESOURCE_GROUPS=${azurerm_resource_group.resourcegroup.name} -v"
  }
}

Note that the playbook will only be run when the resource (here, the domain controller VM) is created. You’ll still need to run it manually every time you update it.

For workstations, we need to be a bit smarter. If we run their Ansible playbook before the domain is properly created by the domain controller playbook, they will obviously fail to join it. We want the following sequence:

  • Create domain controller
  • Create workstations
  • Provision domain controller (including domain creation)
  • Provision workstations

We can use the Terraform null_resource to introduce a “fake resource” with an explicit dependency on the domain controller, and use the local-exec provisioner to run the workstations Ansible playbook only once the domain controller has been provisioned and the domain is ready:

resource "null_resource" "provision_workstations_once_dc_has_been_created" {
  # Note: the dependency on 'azurerm_virtual_machine.workstation' applies to *all* resources created from this block
  # The provisioner will only be run once all workstations have been created (not once per workstation)
  # c.f. https://github.com/hashicorp/terraform/issues/15285
  depends_on = [
    azurerm_virtual_machine.domain_controller, 
    azurerm_virtual_machine.workstation 
  ]

  provisioner "local-exec" {
    command = "ansible-playbook workstations.yml --user localadmin -e domain_admin_password=${random_password.domain_controller_password.result} -e ansible_password=${random_password.workstations_local_admin_password.result} -e AZURE_RESOURCE_GROUPS=${azurerm_resource_group.resourcegroup.name} -v"
  }
}

Once the DC has been created and provisioned and the workstations VMs have been created as well, this null resource will be “created” and its local-exec block will allow us to provision the workstations with our Ansible playbook.

Bringing Windows Event Forwarding into play

Windows Event Forwarding is a built-in mechanism allowing to centralize Windows logs on a single machine often called the “WEC” (Windows Event Collector), “WEF collector”, or “WEF” (incorrectly).

There are multiple ways to configure it, but it’s most often configured as follows:

  • On the WEC:
    • Enable the Windows Event Collector service (wecsvc)
    • Enable the ForwardedEvents event log
    • Create a subscription, essentially saying “I’m allowing machines X to send me their event logs Y and Z
  • On WEF clients (e.g. workstations):
    • Enable WinRM
    • Instruct the machine to send its logs to the WEC

WEF clients are often configured via GPO, but it’s actually entirely possible (and more automation-friendly) to configure them via the registry. For the sake of conciseness (which is probably already ruined given the length of this blog post!), I won’t include the Ansible code here and will instead let you take a look at the following roles:

In our case, to avoid creating too many unnecessary VMs, our domain controller will act as a WEF collector.

Shipping logs to Elasticsearch with Winlogbeat

Once our Windows logs are on the WEC, we need to ship them somewhere. We’ll use Elasticsearch and Kibana since they’re free and easy to install. We’ll also install Elasticsearch and Kibana on the same Ubuntu VM to simply things., and use Winlogbeat to send logs from the domain controller (acting as a WEF collector) to Elasticsearch.

The configuration is pretty straightforward:

winlogbeat.event_logs:
# Collected logs
- name: ForwardedEvents
  forwarded: true

# Logs of the domain controller itself
- name: Security
- name: Microsoft-Windows-Sysmon/Operational

output.elasticsearch:
  hosts:
  - {{ elasticsearch_ip }}:9200
  index: "winlogbeat-%{[agent.version]}-%{+yyyy.MM.dd}"

setup.template.name: "winlogbeat"
setup.template.pattern: "winlogbeat-*"

If you’re interested to see more detail on how to install and configure Winlogbeat with Ansible, take a look at the winlogbeat role!

Query me maybe: Kibana + Elasticsearch

InstallingElasticsearch with Ansible doesn’t pose any specific challenge, so we’ll directly skip through it – see elasticsearch.yml.

We need to be a little smarter for Kibana – when you set it up initially, it’s not readily configured and you need to set up your index patterns through the UI. This is what allows Kibana to know your Windows logs are sitting in Elasticsearch indexes of the form winlogbeat-*. Although Kibana doesn’t offer an option to configure this from its configuration file, we can do it using the API!

- name: List Kibana index templates
  uri:
    url: http://127.0.0.1:5601/api/saved_objects/_find?fields=title&per_page=100&type=index-pattern
    return_content: yes
  register: index_patterns
  # Initially, need to wait a bit until the server is ready
  until: index_patterns.content != "Kibana server is not ready yet" and index_patterns.status == 200
  retries: 100
  delay: 5

- name: Create Kibana index template for winlogbeat
  uri:
    url: http://127.0.0.1:5601/api/saved_objects/index-pattern
    method: POST
    body: '{"attributes":{"title":"winlogbeat-*","timeFieldName":"@timestamp","fields":"[]" }}'
    body_format: json
    headers: {'kbn-xsrf': 'kibana'}
  when: index_patterns.json.saved_objects|length == 0 

Adaz: Wrapping up

I put together Adaz based on all the elements discussing in this post. It’s a bit better packaged and can be configured via a high-level YAML file so you can easily customize the users, groups, workstations, and OUs of your lab:

dns_name: hunter.lab
dc_name: DC-1

initial_domain_admin:
 username: hunter
 password: MyAdDomain!

organizational_units: {}

users:
- username: christophe
- username: dany

groups:
- dn: CN=Hunters,CN=Users
 members: [christophe]

default_local_admin:
 username: localadmin
 password: Localadmin!

workstations:
- name: XTOF-WKS
 local_admins: [christophe]
- name: DANY-WKS
 local_admins: [dany]

enable_windows_firewall: yes

There is also plenty of documentation available including a FAQ! For any suggestion or comment, feel free to open an issue or to reach out on Twitter. 🙂

Bonus

Help! terraform destroy is not working

When destroying the lab, I’ve run into issues where terraform destroy would fail with a message similar to:

Error: Error waiting for update of Network Interface "workstation-2-nic" (Resource Group "ad-lab-resource-group"): Code="OperationNotAllowed" Message="Operation 'startTenantUpdate' is not allowed on VM 'workstation-2' since the VM is marked for deletion. You can only retry the Delete operation (or wait for an ongoing one to complete)." Details=[]

This seems to be a (somewhat) known issue, and while it’s apparently possible to fix it by specifying explicit dependencies between Terraform resources, I ended up finding it easier to simply remove the Azure resource group and remove the Terraform state file:

$ az group delete --yes -g ad-lab-resource-group
$ rm terraform.tfstate

Don’t expect it to be fast, though.

Keeping admin passwords out of the command line

With the approach we took to automatically provision new resources created by Terraform with Ansible, we are passing passwords on the command line:

provisioner "local-exec" {
  command = "ansible-playbook workstations.yml --user localadmin -e domain_admin_password=${random_password.domain_controller_password.result} -e ansible_password=${random_password.workstations_local_admin_password.result} -e AZURE_RESOURCE_GROUPS=${azurerm_resource_group.resourcegroup.name} -v"
}

One of the downsides is that this exposes the password on the machine running Terraform/Ansible:

$ ps aux | grep ansible-playbook                                                                                                                     
[...] ansible-playbook workstations.yml --user localadmin -e domain_admin_password=Y4Eeo1kTb8
F0OcNI -e ansible_password=uMZjoxm6bqR56AjB -e AZURE_RESOURCE_GROUPS=ad-lab-resource-group -v

While we don’t really care for a lab environment like ours, it’s interesting to think about how we could improve this.

It boils down to the following question: how can Terraform and Ansible “communicate” together in a way that doesn’t expose passwords? Hashicorp Vault would be one way. Assuming we have a Vault instance readily available against which we’re already authenticated, we could use the Terraform Vault provider and its vault_generic_secret resource to store the randomly generated password. Once Ansible kicks in, it could then use the hashi_vault lookup plugin to pull this secret securely.

Why not use Packer?

If you’re into the DevOps ecosystem, you might have noticed that the way we’ve been provisioning our Windows machines all along is kind of an anti-pattern and doesn’t follow the immutable infrastructure principle. The complexity of our post-resource creation provisioning is high, meaning that there is a non-negligible risk something will go wrong.

The “right” way to build our lab should in theory look something like:

  • Use Packer to generate an Azure base image of our DC and workstations, already provisioned by Ansible
  • When we want to spin up the lab, use Terraform to instantiate it based on the base images

Packer is an awesome tool and its Azure Resource Manager builder works (out of the box) as follows:

  • Create a temporary resource group
  • Spin up a temporary VM
  • Run our provisioning steps against it (e.g. Ansible)
  • Shut down the VM
  • Convert the VM disk into a disk image
  • Remove all resources in the resource group

The major downside of this approach is that it is slow. First, Azure is slow. Second, since we’re building a base image, it needs to be generalized using Sysprep which is quite slow as well. From my tests, the process takes between 20 and 35 miprnutes to fully complete… while provisioning the full lab from scratch takes between 15 and 20. Admittedly, starting from a base image does make instantiation faster later on, but it also raises the question of what should be included in the base image (much harder to change) and what should be configured once resources have been instantiated in Azure.

In the context of a lab, it’s also arguable whether it makes sense to have to re-build our base image every time we want to perform a non-trivial change in our infrastructure or domain configuration.

This is the rationale why the project doesn’t include Packer resources for now, but please go vote on this issue with a thumbs up if you believe they would be valuable to have!

7 thoughts on “Automating the provisioning of Active Directory labs in Azure

  1. Pingback:

  2. So close to being perfect! A few comments:
    1. You can’t (or rather I can’t) seem to specify a gen2 VM in Terraform. That’s a large security flaw.
    2. PoSH DSC would have been a much nicer VM config tool
    3. Azure best practice is to assign the NSG to the workload subnet, not the machine directly
    4. No Azure Firewall (much better than NSG)
    5. I know you said about RDP Security but Bastion would have made readers feel a lot more comfortable with this connectivity approach, lab or not.

    • Hi Mark,

      Thanks for the feedback!

      1) Point taken, I’ll open an issue. It definitely makes sense to extract the VM size in a Terraform variable
      2) Not convinced about that, I’ve had some feedback tha PoSH DSC was a bit flaky. To be honest, the main reason why I used Ansible and not DSC is that I’m much more familiar with Ansible
      3) Indeed, but for now I wanted the flexibility to specify different FW rules for every machine. I’ll probably use a subnet-wide NSG in the future though
      4) It’s also much more expensive… A $1.25/hour on a <$0.5/hour lab seems hard to justify 5) True, but it also restricts you to using RDP or SSH only - here, I'm hoping to work on some integration with Sigma/Elastalert and I'm also exposing Kibana, so I'm not sure it would make sense to put everything behind an Azure Bastion. Another idea would be to put everything behind a generic Linux box and access the lab through a SSH tunnel, that's entirely possible but I fear it would unnecessarily increase complexity

      • The design choices as expressed in this comment make sense. The Linux box for SSH tunnel would be a nice to have, a bash or python glueware script could help the inexperienced users configure the tunnel.

  3. When creating the association between NSG and Domain controller NIC, please correct the resource name.
    ===================
    resource “azurerm_network_interface_security_group_association” “domain_controller” {
    network_interface_id = azurerm_network_interface.domain_controller_nic.id
    network_security_group_id = azurerm_network_security_group.domain_controller.id
    }
    ===========
    Here it should be azurerm_network_interface.dc_nic.id

    Thanks

    • Corrected, thanks!

  4. wow, your blog is amazing….
    keep it up.
    if you can, please do more CTF write-ups.

Leave a Reply

Your email address will not be published. Required fields are marked *