Leading digital advertising agency advances data analytics using Vertica on AWS cloud

ABOUT THE CLIENT 

Integral Ad Science (IAS) is a global market leader in digital ad verification, offering technologies that drive high-quality advertising media. IAS equips advertisers and publishers with the insight and technology to protect advertising investments from fraud and unsafe environments as well as to capture consumer attention and deliver business outcomes. Founded in 2009, IAS is headquartered in New York, with global operations in 18 offices across 13 countries.

THE CHALLENGE

The client wanted to use their business intelligence tool to query their historical data for 90 days in under two seconds with more than 80 concurrent users. Their current system included data in Hadoop clusters and Infobright DB on-premise cluster, which was unable to handle their data analytics requirements.

THE SOLUTION

Beyondsoft’s Big Data consulting team proposed a solution using Vertica, a columnar database on Amazon Web Services (AWS). The AWS cloud solution would provide the added scalability, elasticity, and performance that the customer wanted. The project consisted of creating Vertica clusters in a repeatable manner and a pipeline-based approach for Vertica DDL. It also included moving large amounts of data daily to the Vertica cluster.

The project consisted of three phases:

  • Phase 1 involved creating a repeatable deployment process through infrastructure as code (Terraform) for Vertica cluster. The Vertica AMI was procured from AWS Marketplace. Beyondsoft engineers added the ability to launch different Vertica nodes as part of a cluster through tags to AWS Elastic Load Balancer (ELB). The Vertica cluster consist of two AZ’s in active/active nodes, with 16 nodes total and 90 days of data, which came to approximately 40TB. AWS Systems Manager Service (SSM) and CloudWatch Logs are used for administration of the cluster. This Vertica infrastructure as code also integrates with the customer’s self-servicing tool for their developers.
  • Phase 2 included a pipeline-based approach for Vertica DDL. Vertica DDL is pushed through the pipeline using Liquibase, a java framework for database change and deployment. This ensures that production Vertica clusters are not touched manually for schema changes.
  • Phase 3 involved setting up ETL from the Hadoop cluster to Vertica using a producer/consumer pattern. The ETL code is written in python code with on-demand Fargate containers, which extract data from Hadoop and store it in zipped files in S3. From there, jobs are created to load data into Vertica from S3. The data is around 120GB/day with around 570M rows loaded at its peak. The customer-facing java application has several dashboards which are able to procure data from Vertica in under two seconds query time with concurrent usage.
  • TECHNOLOGIES USED

    Vertica on AWS, AWS SSM, AWS CloudWatch logs, AWS S3, AWS ELB, AWS Fargate, AWS Parameter Store, AWS ECR, Python, Jenkins, etc.

    KNOWLEDGE TRANSFER

    Beyondsoft educated the client’s data analytics team around the newly created solution and Terraform and provided a runbook, to enable them to both manage and add to the solution in future. Beyondsoft also provided education on the various AWS services and customized training sessions on various topics.

    BENEFITS

    Moving from an on-premise cluster to the cloud increased the scalability, agility, and performance of the whole solution. Taking a DevOps approach through data pipelines decreased go-to-market time for code changes. Infrastructure as code provided a repeatable way to create infrastructure, increasing operational consistency and reducing bugs.

    Share on linkedin
    LinkedIn
    Share on email
    Email
    Share on print
    Print