The bridge between existing data platforms and GenAI

With the rise of Generative AI, organizations are seeking to bridge the gap between traditional data storage methods and the innovative capabilities that AI offers. To help with that challenge, we have developed a framework which efficiently executes GenAI workloads as part of orchestrated pipelines.

October 29, 2024
Dr. Mazidian has 10+ years of experience with data and holds a PhD in Theoretical Physics.
3 min read
3 min read
October 29, 2024
Dr. Mazidian has 10+ years of experience with data and holds a PhD in Theoretical Physics.

With the rise of Generative AI (GenAI), organizations are seeking to bridge the gap between traditional data storage methods and the innovative capabilities offered by GenAI. This transformation involves integrating GenAI with existing data storage and analytics systems without requiring a complete overhaul of infrastructure. By utilizing methods like Retrieval-Augmented Generation (RAG), and frameworks such as LangChain, companies can enrich their data pipelines with GenAI, leading to more sophisticated systems that can capitalize on the potential for enhanced insights.

Many companies currently rely on Big Data Warehouses, such as Google Cloud Platform’s (GCP) BigQuery, for their advanced analytics needs, rather than Vector Databases whose popularity has largely arisen due to GenAI. Additionally, their workflows are often batch-oriented rather than real-time. This can present a challenge to GenAI adoption, which is frequently built around real-time data retrieval and analysis. For instance, LLM-agent patterns rely on RAG to pull relevant information in real-time, leveraging Vector DBs and embedding models. However, in environments that aren't yet adapted to these new models, the challenge lies in unlocking the potential of existing data using familiar data manipulation languages such as SQL, alongside orchestrators like Composer Airflow, to pre-process and execute GenAI workloads efficiently.

At Beyond, we have developed a framework with a customer which efficiently executes GenAI workloads as part of orchestrated pipelines such as those in Composer Airflow and utilizing the data which already exists in BigQuery. This means our customers can enhance their existing data with GenAI models, such as Vertex AI’s Gemini, by using SQL in place of vector embeddings, and pipeline orchestrators to pre-compute large batches of completions rather than responding to new prompts in real-time.

The necessary components to build a bridge

A general inference pattern

The ‘LangChain’ chains ‘StuffDocumentsChain’ and ‘ReduceDocumentsChain’ are especially useful in defining a create-once-use-many inference class which allows for the efficient handling of large sets of documents by breaking them down into manageable chunks and subsequently reducing them into a coherent response. As a result of the fact that we already know about the relationships between data in our Big Data Warehouse we don’t need any embeddings to attach the relevant data to these chains. Although these chains have restrictions in terms of input cardinality, we have been able to integrate every use-case requested with a single executor by:

  • Intelligently engineering prompts to encapsulate equivalent context.
  • Engineering process flows to account for token limits.
  • Providing comprehensive use-case configuration specifying chain types, documents per chain and other essential use-case metadata.
A technical diagram showing the connection between documents and a natural language API

Asynchronous processing

Implementing asynchronous processing ensures that data retrieval and GenAI workload execution do not block the overall pipeline. By handling these tasks concurrently, we can maximize resource utilization and minimize latency, allowing for faster overall throughput.

Output parsing

This component focuses on transforming the raw outputs generated by the GenAI models into structured and actionable insights. By defining clear parsing rules, we can ensure that the outputs are usable for downstream applications and align with business needs.

Retry mechanism

A robust retry mechanism is essential for handling transient errors and ensuring the reliability of the pipeline. This allows the system to automatically retry failed tasks a predefined number of times, thus enhancing fault tolerance and ensuring that data processing is resilient against temporary issues.

Post-evaluation

After the data has been processed and outputs generated, post-evaluation steps are necessary to assess the quality and relevance of the results. This could involve validating outputs against predefined criteria or user feedback, ensuring that the generated insights meet the expected standards before they are utilized in decision-making processes.

Using these methods we are able to carry out every use-case of our customers within a single framework, with a single executor and reduce execution time by a factor of 10 without pushing the limits of network capacity. This structured approach enables organizations to effectively bridge the gap between traditional data storage methods and the innovative capabilities offered by GenAI, maximizing the value derived from their existing data assets.


If you would like to know more about enhancing your existing data assets with the power of GenAI, please contact us.