GitHub · Where software is built

Milestones

VDK Feature Proposals
Put feature proposal tickets here
No due date
•0/1 issues closed
0% complete1 open 0 closed
VDK on AWS
No due date
•7/13 issues closed
53% complete6 open 7 closed
Multiple Databases In One Job
No due date
•22/27 issues closed
81% complete5 open 22 closed
multi-pod data-job instance
This should be completed in a separate branch. When every ticket in this milestone is complete and all edge cases and nuanced behaviour is understood then it can be merged into main.
No due date
•0/7 issues closed
0% complete7 open 0 closed
PAIH Demo
No due date
•6/6 issues closed
100% complete0 open 6 closed
Google Colab Notebooks for VDK Examples
**What is the feature request? What problem does it solve?** With the recent introduction of features supporting notebook integration in VDK (see [examples here](https://github.com/vmware/versatile-data-kit/wiki/Examples#jupyter-tutorials) ), there's a significant opportunity to enhance user engagement and ease of use. The idea is to transition from local, Python file-based examples to interactive Google Colab notebook tutorials. This approach will not only simplify the initial setup (eliminating the need for local installation) but also provide a more comprehensive and guided learning experience through the use of markdowns and images. **Suggested solution** - Transform current examples, which are predominantly local and Python-file based, into Google Colab notebooks. - Utilize the notebook format to create step-by-step tutorials, integrating explanations, code, and visual aids for a more immersive learning experience - Ensure that these notebooks are easily accessible and runnable without requiring local environment setup, thus lowering the entry barrier for new users. -Follow the recommended tutorial guidelines as outlined in [VMware VDK Tutorial Guidelines](https://github.com/vmware/versatile-data-kit/wiki/Tutorial-Guidelines) to maintain consistency and quality. A good already google collab based tutorial to use as a template is https://bit.ly/vdk-ingest (though it's longer because it's used for workshops, examples should be shorter). A good starting point to get introduced to jupyter integartions are: - https://colab.research.google.com/drive/16pBJQePbqkz3QFV54L4NIkOn1kwpuRrj - https://github.com/vmware/versatile-data-kit/wiki/Create-a-Data-Job-through-the-Jupyter-UI - https://github.com/vmware/versatile-data-kit/wiki/Develop-a-Data-Job-through-the-Jupyter-UI - https://github.com/vmware/versatile-data-kit/wiki/Convert-Data-Job-to-Jupyter-Notebook - also https://bit.ly/vdk-ingest
No due date
•0/4 issues closed
0% complete4 open 0 closed
vec
No due date
0% complete0 open 0 closed
VDK Run Logs: Post Usability Testing
No due date
•3/4 issues closed
75% complete1 open 3 closed
Private AI: Dataset creation
Last year antoni did a proof of concept on generating datasets which can be used to train LLMs. The goal of this milestone is to take that idea and make it production ready. After this milestone is complete. We should have examples of creating LLM datasets from 2 different organizational datasources (e.g gitlab and confluence or gitlab and jira). We should have examples of saving LLM datasets to 2 different dataset storage repositories (e.g huggingface dataset registry and persistent attached volume on k8s). We should also be able to show that it is somewhat configurable to meet customer changing needs.
No due date
•0/1 issues closed
0% complete1 open 0 closed
Private AI: Vector database Ingestion
**VEP**: https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-milestone-25-vector-database-ingestion With the rise in popularity of LLMs and RAG we see VDK as a core component to getting the data where we need it to be. ![image](https://github.com/vmware/versatile-data-kit/assets/2536458/5ee65fdb-fa63-4dc8-b9c1-a6a6bb5a19f3) ### Example problem scenario: A company has a powerful private LLM chatbot. However they want it to be able to answer questions using the latest version of confluence docs jira tickets etc... Retraining every night on the latest tickets/docs is not feasible. Instead the opt to use RAG to improve the chatbot responses. This leaves them with the question. How do we populate the data? Steps they need to complete 1. Read data from confluence/jira 2. Chunk into paragraphs(or something similar) 3. Embed into vector space 4. save Vector and paragraph in vector database 5. remove old information. For example if we are scraping jira every hour and we are writing details to the vector database we need to make sure we clean up all embeddings/chunks which were generated from old versions of the ticket. ### Our goal We want to template this. We will build a datajob in VDK which reads data from confluence or jira and writes it to a DSM postgres instance with PGVector enabled. A embedding model will be running on a different machine which will be exposed through an API. We will make requests to the API to create embeddings for us. After this datajob is running we will create a template from this in which we think customers will be able to adopt to meet their use cases. ### Proposed database table solution embedding | text chunk | document id [1,2,3,4,5,6] | in this document blah... | 15 ### Prerequisite reading: - Langchain (https://python.langchain.com/docs/expression_language/cookbook/retrieval) #### Learning materials - https://www.freecodecamp.org/news/vector-search-and-rag-tutorial-using-llms-with-your-data/
No due date
•21/30 issues closed
70% complete9 open 21 closed
VDK Oracle
No due date
•17/22 issues closed
77% complete5 open 17 closed
Unified Configuration
Currently if you want to configure VDK you have to do it differently depending on what is installed. - vdk-control-cli is configured using environment variables and ~/.vdk/config file - vdk-core uses [ConfigurationBuilder](https://github.com/vmware/versatile-data-kit/blob/2244f019f38391564600c0fb3101ca9d5bdef6f7/projects/vdk-core/src/vdk/internal/core/config.py#L137) and vdk_configure plugin to bother define and set values. Values are set using [job config ini or env variables plugin](https://github.com/vmware/versatile-data-kit/blob/2244f019f38391564600c0fb3101ca9d5bdef6f7/projects/vdk-core/src/vdk/internal/builtin_plugins/config/vdk_config.py#L147) - Control Service accept configuration using vkd-options.ini helm chart option which sets the correct environment variables - Some older options are provided in specific sections . While most options are in [vdk] section of config.ini .See [example](https://github.com/vmware/versatile-data-kit/blob/2244f019f38391564600c0fb3101ca9d5bdef6f7/projects/vdk-control-cli/src/vdk/internal/control/job/sample_job/config.ini#L4) - The current configuration providers (ini files) doesn't support data types (only string). --- The solution should be to unify this configuration and provide consistent configuration management in an automated way. For example - Move [vdk-config folder](https://github.com/vmware/versatile-data-kit/blob/2244f019f38391564600c0fb3101ca9d5bdef6f7/projects/vdk-control-cli/src/vdk/internal/control/configuration/vdk_config.py#L144) logic to a common place (likely vdk-core) - Expand the logic to allow setting global config file which can be overridden by more local (per team or per directory? -perahps it can look for ~/.vdk-internal folder in each directory until the home dir and take the first it finds? for each config option ? (See [here](https://github.com/vmware/versatile-data-kit/wiki/Research:-Getting-started-and-Configuration#provide-machine-level-global-settings) - Allow users to set "sensitive" VDK configuration (like impala password) - Allow [vdk options in jupyter settings](https://jupyterlab.readthedocs.io/en/stable/user/directories.html#jupyterlab-user-settings-directory) - Provide TOML or Yaml configuration formats
No due date
0% complete0 open 0 closed
Guided workflow with VDK Blueprints
When someone is new to a framework, the learning curve can often be steep. Offering a guided workflow can simplify this experience. By implementing an interactive CLI or notebook-based wizard that helps users set up their data jobs, source and destination configurations, etc., you can make it easier for new users to understand the framework's capabilities quickly. This can significantly reduce the time it takes to set up an initial PoC and evaluate VDK. See https://github.com/vmware/versatile-data-kit/wiki/Research:-Getting-started-and-Configuration#guided-workflow--wizard-assistant-1
No due date
•0/6 issues closed
0% complete6 open 0 closed
Python Files Configuration
As a data engineer using VDK, I want to define configuration using Python classes so that I can benefit from autocompletion and type checking. - For each configuration class, there should be autocompletion, type checking, and tooltips in supported IDEs. - Deprecated options should display warnings when used. - The plugin developed for handling Python configurations should be compatible with VDK and should support all aforementioned features. Non goal: - Configuration files must be environment-specific and selectable via CLI.
No due date
•0/3 issues closed
0% complete3 open 0 closed
Existing tutorials follow Tutorial Guidelines
We have established certain guidelines for good tutorial (or example) in https://github.com/vmware/versatile-data-kit/wiki/Tutorial-Guidelines Most of the tutorials/examples were written before that so they are not really following them very well . We should re-refactor them. I think likely this means splitting some tutorials, adding some more visualization, improving structure and testing them with real users.
No due date
0% complete0 open 0 closed
VDK Run Logs: Promotional Materials
No due date
0% complete0 open 0 closed
VDK Run Logs: Documentation
No due date
0% complete0 open 0 closed
VDK Run Logs: Progress Indicators
Introduce progress indicators to data jobs running locally. Redirect log output to temp file.
No due date
•6/17 issues closed
35% complete11 open 6 closed
VDK Run Logs: Clean Error Handling
Error messages should clearly state the problem without placeholder and repeated logs text so that users can directly understand what went wrong. Users should be able to see the original exception when it's passed up the call stack and is in the user code so they can handle it. VDK developers should be discouraged to use generic error messages like "An error occurred". This will give more meaningful feedback to users.
No due date
•17/17 issues closed
100% complete0 open 17 closed
VDK Run Logs: Log Less
Reduce the amount of logging at the INFO level. Document per-job log level configuration options.
No due date
•2/2 issues closed
100% complete0 open 2 closed
VDK Run Logs: Log Structure
Introduce structured logging to the VDK project. Give users the ability to configure how much metadata they see in their logs.
No due date
•25/26 issues closed
96% complete1 open 25 closed
Python DB API databases support in VDK Discovery and POC
The end goal of this milestone is to have prepared POC and a enhancement proposal for building comprehensive and easy to use support for almost any DB API (PEP 249) compatible database. Part of initiative https://github.com/vmware/versatile-data-kit/issues/2421 See also https://github.com/vmware/versatile-data-kit/issues/1444
Overdue by 2 year(s)
•
Due by August 31, 2023
•4/4 issues closed
100% complete0 open 4 closed
Getting started with your data and infrastructure: Discovery
VEP is written
Overdue by 2 year(s)
•
Due by August 23, 2023
•6/7 issues closed
85% complete1 open 6 closed
Main Readme improvements v3
The goal is to implement follow-up improvements after release of v1 of the Main Readme refactoring designed at https://www.canva.com/design/DAFnq9ouriY/FxmviN90b9ofU5EvNV5HLQ/view
No due date
•1/4 issues closed
25% complete3 open 1 closed
VDK Notebook Alpha - Tech dept
This is the leftover work from the VDK Notebooks - Alpha initiative, that will be handled as part of VDK Notebooks - Beta. Included in VDK Initiative(s): #2419
No due date
•0/1 issues closed
0% complete1 open 0 closed
VDK Notebook Alpha Documentation
VDK Notebook Alpha Documentation Included in VDK Initiative(s): #2427
No due date
•2/5 issues closed
40% complete3 open 2 closed
VDK Notebook Beta
No due date
•0/9 issues closed
0% complete9 open 0 closed
Documentation (wiki) restructuring
Make it easier to navigate and find important information. Problems identified: - Documentation is not structured well to be intuitive: scattered, repetitive. - difficult to navigate and difficult to find what you need. - Documentation and different (use) cases are found difficult Goals - Restructuring of documentation, in the Wiki. A new way of structuring the wiki must be created, agreed upon, and implemented. - No new documentation needs to be written as part of this issue. Only existing documentation refactoring is required. That may including moving, deleting, splitting up pages or combining pages as deemed necessary Included in VDK Initiative(s): #2426
No due date
•4/11 issues closed
36% complete7 open 4 closed
VDK: Bugfixes & Small Enhancements
VDK Bucket initiative for small (enhancements, bug fixes or general work) items that do not deserve own epic
Overdue by 2 year(s)
•
Due by December 31, 2023
•15/31 issues closed
48% complete16 open 15 closed
Main Readme improvements
User can see and understand what they can do with VDK from the main page. Included in VDK Initiative(s): #2426 Design https://www.canva.com/design/DAFnq9ouriY/FxmviN90b9ofU5EvNV5HLQ/view
Overdue by 2 year(s)
•
Due by August 14, 2023
•10/10 issues closed
100% complete0 open 10 closed
Documentation small enhancements and fixes
VDK Bucket initiative for small (enhancements, fixes or general work) items related to VDK Documentation that do not deserve own epic Included in VDK Initiative(s): #2426
No due date
•4/18 issues closed
22% complete14 open 4 closed
VDK Run Logs: Discovery
Discovery Phase for VDK Run Logs Initiative Included in VDK Initiative(s): #2408
No due date
•14/14 issues closed
100% complete0 open 14 closed