List view
Put feature proposal tickets here
No due date•0/1 issues closed- No due date•7/13 issues closed
- No due date•22/27 issues closed
This should be completed in a separate branch. When every ticket in this milestone is complete and all edge cases and nuanced behaviour is understood then it can be merged into main.
No due date•0/7 issues closed- No due date•6/6 issues closed
**What is the feature request? What problem does it solve?** With the recent introduction of features supporting notebook integration in VDK (see [examples here](https://github.com/vmware/versatile-data-kit/wiki/Examples#jupyter-tutorials) ), there's a significant opportunity to enhance user engagement and ease of use. The idea is to transition from local, Python file-based examples to interactive Google Colab notebook tutorials. This approach will not only simplify the initial setup (eliminating the need for local installation) but also provide a more comprehensive and guided learning experience through the use of markdowns and images. **Suggested solution** - Transform current examples, which are predominantly local and Python-file based, into Google Colab notebooks. - Utilize the notebook format to create step-by-step tutorials, integrating explanations, code, and visual aids for a more immersive learning experience - Ensure that these notebooks are easily accessible and runnable without requiring local environment setup, thus lowering the entry barrier for new users. -Follow the recommended tutorial guidelines as outlined in [VMware VDK Tutorial Guidelines](https://github.com/vmware/versatile-data-kit/wiki/Tutorial-Guidelines) to maintain consistency and quality. A good already google collab based tutorial to use as a template is https://bit.ly/vdk-ingest (though it's longer because it's used for workshops, examples should be shorter). A good starting point to get introduced to jupyter integartions are: - https://colab.research.google.com/drive/16pBJQePbqkz3QFV54L4NIkOn1kwpuRrj - https://github.com/vmware/versatile-data-kit/wiki/Create-a-Data-Job-through-the-Jupyter-UI - https://github.com/vmware/versatile-data-kit/wiki/Develop-a-Data-Job-through-the-Jupyter-UI - https://github.com/vmware/versatile-data-kit/wiki/Convert-Data-Job-to-Jupyter-Notebook - also https://bit.ly/vdk-ingest
No due date•0/4 issues closed- No due date
- No due date•3/4 issues closed
Last year antoni did a proof of concept on generating datasets which can be used to train LLMs. The goal of this milestone is to take that idea and make it production ready. After this milestone is complete. We should have examples of creating LLM datasets from 2 different organizational datasources (e.g gitlab and confluence or gitlab and jira). We should have examples of saving LLM datasets to 2 different dataset storage repositories (e.g huggingface dataset registry and persistent attached volume on k8s). We should also be able to show that it is somewhat configurable to meet customer changing needs.
No due date•0/1 issues closed**VEP**: https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-milestone-25-vector-database-ingestion With the rise in popularity of LLMs and RAG we see VDK as a core component to getting the data where we need it to be.  ### Example problem scenario: A company has a powerful private LLM chatbot. However they want it to be able to answer questions using the latest version of confluence docs jira tickets etc... Retraining every night on the latest tickets/docs is not feasible. Instead the opt to use RAG to improve the chatbot responses. This leaves them with the question. How do we populate the data? Steps they need to complete 1. Read data from confluence/jira 2. Chunk into paragraphs(or something similar) 3. Embed into vector space 4. save Vector and paragraph in vector database 5. remove old information. For example if we are scraping jira every hour and we are writing details to the vector database we need to make sure we clean up all embeddings/chunks which were generated from old versions of the ticket. ### Our goal We want to template this. We will build a datajob in VDK which reads data from confluence or jira and writes it to a DSM postgres instance with PGVector enabled. A embedding model will be running on a different machine which will be exposed through an API. We will make requests to the API to create embeddings for us. After this datajob is running we will create a template from this in which we think customers will be able to adopt to meet their use cases. ### Proposed database table solution embedding | text chunk | document id [1,2,3,4,5,6] | in this document blah... | 15 ### Prerequisite reading: - Langchain (https://python.langchain.com/docs/expression_language/cookbook/retrieval) #### Learning materials - https://www.freecodecamp.org/news/vector-search-and-rag-tutorial-using-llms-with-your-data/
No due date•21/30 issues closed- No due date•17/22 issues closed
Currently if you want to configure VDK you have to do it differently depending on what is installed. - vdk-control-cli is configured using environment variables and ~/.vdk/config file - vdk-core uses [ConfigurationBuilder](https://github.com/vmware/versatile-data-kit/blob/2244f019f38391564600c0fb3101ca9d5bdef6f7/projects/vdk-core/src/vdk/internal/core/config.py#L137) and vdk_configure plugin to bother define and set values. Values are set using [job config ini or env variables plugin](https://github.com/vmware/versatile-data-kit/blob/2244f019f38391564600c0fb3101ca9d5bdef6f7/projects/vdk-core/src/vdk/internal/builtin_plugins/config/vdk_config.py#L147) - Control Service accept configuration using vkd-options.ini helm chart option which sets the correct environment variables - Some older options are provided in specific sections . While most options are in [vdk] section of config.ini .See [example](https://github.com/vmware/versatile-data-kit/blob/2244f019f38391564600c0fb3101ca9d5bdef6f7/projects/vdk-control-cli/src/vdk/internal/control/job/sample_job/config.ini#L4) - The current configuration providers (ini files) doesn't support data types (only string). --- The solution should be to unify this configuration and provide consistent configuration management in an automated way. For example - Move [vdk-config folder](https://github.com/vmware/versatile-data-kit/blob/2244f019f38391564600c0fb3101ca9d5bdef6f7/projects/vdk-control-cli/src/vdk/internal/control/configuration/vdk_config.py#L144) logic to a common place (likely vdk-core) - Expand the logic to allow setting global config file which can be overridden by more local (per team or per directory? -perahps it can look for ~/.vdk-internal folder in each directory until the home dir and take the first it finds? for each config option ? (See [here](https://github.com/vmware/versatile-data-kit/wiki/Research:-Getting-started-and-Configuration#provide-machine-level-global-settings) - Allow users to set "sensitive" VDK configuration (like impala password) - Allow [vdk options in jupyter settings](https://jupyterlab.readthedocs.io/en/stable/user/directories.html#jupyterlab-user-settings-directory) - Provide TOML or Yaml configuration formats
No due dateWhen someone is new to a framework, the learning curve can often be steep. Offering a guided workflow can simplify this experience. By implementing an interactive CLI or notebook-based wizard that helps users set up their data jobs, source and destination configurations, etc., you can make it easier for new users to understand the framework's capabilities quickly. This can significantly reduce the time it takes to set up an initial PoC and evaluate VDK. See https://github.com/vmware/versatile-data-kit/wiki/Research:-Getting-started-and-Configuration#guided-workflow--wizard-assistant-1
No due date•0/6 issues closedAs a data engineer using VDK, I want to define configuration using Python classes so that I can benefit from autocompletion and type checking. - For each configuration class, there should be autocompletion, type checking, and tooltips in supported IDEs. - Deprecated options should display warnings when used. - The plugin developed for handling Python configurations should be compatible with VDK and should support all aforementioned features. Non goal: - Configuration files must be environment-specific and selectable via CLI.
No due date•0/3 issues closedWe have established certain guidelines for good tutorial (or example) in https://github.com/vmware/versatile-data-kit/wiki/Tutorial-Guidelines Most of the tutorials/examples were written before that so they are not really following them very well . We should re-refactor them. I think likely this means splitting some tutorials, adding some more visualization, improving structure and testing them with real users.
No due date- No due date
- No due date
Introduce progress indicators to data jobs running locally. Redirect log output to temp file.
No due date•6/17 issues closedError messages should clearly state the problem without placeholder and repeated logs text so that users can directly understand what went wrong. Users should be able to see the original exception when it's passed up the call stack and is in the user code so they can handle it. VDK developers should be discouraged to use generic error messages like "An error occurred". This will give more meaningful feedback to users.
No due date•17/17 issues closedReduce the amount of logging at the INFO level. Document per-job log level configuration options.
No due date•2/2 issues closedIntroduce structured logging to the VDK project. Give users the ability to configure how much metadata they see in their logs.
No due date•25/26 issues closedThe end goal of this milestone is to have prepared POC and a enhancement proposal for building comprehensive and easy to use support for almost any DB API (PEP 249) compatible database. Part of initiative https://github.com/vmware/versatile-data-kit/issues/2421 See also https://github.com/vmware/versatile-data-kit/issues/1444
Overdue by 2 year(s)•Due by August 31, 2023•4/4 issues closedVEP is written
Overdue by 2 year(s)•Due by August 23, 2023•6/7 issues closedThe goal is to implement follow-up improvements after release of v1 of the Main Readme refactoring designed at https://www.canva.com/design/DAFnq9ouriY/FxmviN90b9ofU5EvNV5HLQ/view
No due date•1/4 issues closedThis is the leftover work from the VDK Notebooks - Alpha initiative, that will be handled as part of VDK Notebooks - Beta. Included in VDK Initiative(s): #2419
No due date•0/1 issues closedVDK Notebook Alpha Documentation Included in VDK Initiative(s): #2427
No due date•2/5 issues closed- No due date•0/9 issues closed
Make it easier to navigate and find important information. Problems identified: - Documentation is not structured well to be intuitive: scattered, repetitive. - difficult to navigate and difficult to find what you need. - Documentation and different (use) cases are found difficult Goals - Restructuring of documentation, in the Wiki. A new way of structuring the wiki must be created, agreed upon, and implemented. - No new documentation needs to be written as part of this issue. Only existing documentation refactoring is required. That may including moving, deleting, splitting up pages or combining pages as deemed necessary Included in VDK Initiative(s): #2426
No due date•4/11 issues closedVDK Bucket initiative for small (enhancements, bug fixes or general work) items that do not deserve own epic
Overdue by 2 year(s)•Due by December 31, 2023•15/31 issues closedUser can see and understand what they can do with VDK from the main page. Included in VDK Initiative(s): #2426 Design https://www.canva.com/design/DAFnq9ouriY/FxmviN90b9ofU5EvNV5HLQ/view
Overdue by 2 year(s)•Due by August 14, 2023•10/10 issues closedVDK Bucket initiative for small (enhancements, fixes or general work) items related to VDK Documentation that do not deserve own epic Included in VDK Initiative(s): #2426
No due date•4/18 issues closedDiscovery Phase for VDK Run Logs Initiative Included in VDK Initiative(s): #2408
No due date•14/14 issues closed