Commits
Click on a commit to change the comparison rangeBump version to `1.4.0`5 months ago
by gabrielmbmb Temporary (using `pip`) fix for installing `llama-cpp-python` in CI (#886)5 months ago
by gabrielmbmb Fix unit tests after release of `transformers==4.44.0` (#891)
* Update unit tests so they work with `transformers>=4.44.0`
* fix more unit tests5 months ago
by gabrielmbmb Fix default structured output (#892)
* Add check for dependencies for structured outputs and change default value of structured outputs
* Update tests with serialized default structured output
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Send as many batches as possible to input queues (#895)
* Update `_manage_batch_flow` to send as many batches as can be built
* Fix load stages
* Fix unit test
* Fix `argilla` unit test after release `2.0.1`
* Can fail5 months ago
by gabrielmbmb Exclude `repo_id` from `LoadDataFromFileSystem` (#898)
* Exclude repo_id from LoadDataFromFileSystem generator class and update tests
* Update code to be compatible with python 3.9 Fix loader to read from a glob pattern (#877)
* Fix loader to read from a glob pattern
* Fix to read from general UPath instead of Path
* Update tests to use glob patterns
* Refactor to simplify check for glob pattern Add `save_artifact` method to `_Step` (#871)
* Add `save_artifact` method
* Upload pipeline generated artifacts
* Fix log file was being saved in different cache
* Update `save_to_disk` to also save artifacts
* Render artifacts in card
* Update unit tests
* Add missing unit tests
* Update src/distilabel/distiset.py
Co-authored-by: Agus <agustin@argilla.io>
* Add section about saving artifacts
* Add correct `edit_uri`
---------
Co-authored-by: Agus <agustin@argilla.io>5 months ago
by gabrielmbmb Add new `add_raw_input` argument to `_Task` so we can automatically include the formatted input (#903)
* Add attribute to include raw formatted input to distilabel_metadata field
* Update tests to take into account add_raw_input attribute of tasks
* Add reference to add_raw_input in the documentation
* Update tests to control for the add_raw_input of the _Task New `TruncateTextColumn` to truncate the length of texts using the number of tokens or characters (#902)
* Add new category for text manipulation and sort the dict aplhabetically
* Redirect import
* Add new TruncateRow step to truncate the text using the number of characters or tokens
* Add tests for TruncateRow
* Update tokenizer name to avoid errors accessing the repo in CI
* Update src/distilabel/steps/__init__.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update src/distilabel/steps/truncate.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Refactor tokenizer_name to tokenizer for consistency
* Update test for the tokenizer refactor
* Refactor TruncateRow to TruncateTextColumn
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Update `inputs` and `outputs` interface to allow returning dict indicating optionality (#883)
* Use `CudaDevicePlacementMixin` in `RewardModelScore` step
* Add `StepColumns` type
* Update inputs and outputs validation
* Update type hints
* Update inputs checking
* Add unit test for checking inputs/outputs with dict
* Update type hints
* Update `inputs` and `outputs` return
* Add missing inputs and outputs in docstring
* Update docs5 months ago
by gabrielmbmb Update mistrallm (#904)
* Update mistralai client to version 1.*.*
* Update tests for new mistral client Deepseek prover (#907)
* Add deepseek prover autoformalization task
* Add task for the scorer as a jinja template to make it easy to maintain
* Add deepseek prover scorer task
* Add tests for the scorer task
* Redirect import
* Create a folder for the deepseek-prover templates
* Make generator task more general including few shot examples
* Remove the few shot argument as we can determine by just checking for examples
* Remove deepseek-prover from the core as they are not that relevant for general pipelines
* Add deepseek prover pipeline
* Add entry for the paper implementation
* Remove tests
* Remove import
* Remove redirected import Update `RewardModelScore.inputs` to define optional input columns (#908)5 months ago
by gabrielmbmb Add tutorial - generate data for training embeddings and reranking models (#893)
* Add initial outline tutorial
* Add section on data quality evaluation
* Add conslusion
* Update pipeline_samples structure for adding tutorials in a similar way as Argilla docs
* Update new structure tutorials
* Update title
* Update to use Free serverless Inference API
* Process comments from code review
* Remove sections from header
* Updated formatting examples
* Add grid arror on new line
* update phrasing
* update phrasing5 months ago
by davidberenstein1957 Fix load data from disk (#910)
* Fix repo_id in load and make config argument optional if possible
* Add tests for LoadFromDisk
* Update src/distilabel/steps/generators/huggingface.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Make error more informative
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> docs: minor fixes (#913)
* Fix minor error deepseep prover
* Fix minor type generate sentence pairs5 months ago
by davidberenstein1957 Add `URIAL` task (#921)
* Initial work for `URIAL`
* Update template
* Fix checking last message
* Add `format_output` logic
* Refine `format_output` and add docstring
* Add `References`
* Add `URIAL` unit tests4 months ago
by gabrielmbmb Add `vLLMEmbeddings` (#920)
* Add vLLMEmbeddings to work with multiple GPUs
* Add mocked tests docs: add tutorials preference and clean (#917)
* add tutorials
* clean dataset tutorial
* generate preference dataset tutorial
* modify sentence pairs tutorial
* add to index
* add missing component
* fix: first feedback
* fix: add headers
* fix: process for steps
* fix: typo and note
* add torch
* fix typo Fix `StructuredGeneration` examples and internal check (#912)
* Fix error with instructor schema input
* Fix examples of structured generation
* Try inferring the type of format in case the user forgets informing about it Generate deterministic pipeline name when it's not given (#878)
* Generate deterministic pipeline name when it's not given
* Use the names of the steps to generate the default pipeline name
* Update test with the steps names
* Add suggestion from code review Add custom errors (#911)
* Module to store custom errors from distilabel
* Add tests for the new error types
* Refactor ValueError to DistilabelUserError and provide a page in the docs with further info
* Fix typo in docs
* Refactor ValueError to DistilabelUserError with reference page
* Add new error type for TypeErrors
* Add DistilabelTypeError for base _Step
* Add documentation section for step wrapper
* Update src/distilabel/errors.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Apply comments from code review
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Merge branch 'main' into develop4 months ago
by gabrielmbmb Docs/tutorials fix (#922)
* clean: duplicated header, installation, connect
* preference: installation, viewer, connect
* reranking: installation, tokenizer, connect
* example dataset sentence pairs
* workaround typing
* typo
* typo
* cookbook feedback Add Plausible as replacement for GA (#929)4 months ago
by davidberenstein1957 Add minhash related steps to deduplicate texts (#931)
* Initial work for minhash
* Add minhash step redirect
* Add first version of minhash and minhashlsh
* Add unit tests for minhash dedup
* Add pipeline testing deduplication
* Add tests to run with disk backend
* Add tests for the disk and ensure unload
* Add private _datasketch module to include a custom storage configuration for the minhash index
* Add docstrings to the internal classes/functions
* Add docstrings for the user facing classes
* Update src/distilabel/steps/filtering/minhash.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update src/distilabel/steps/filtering/minhash.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update tests/integration/test_deduplication.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update src/distilabel/steps/filtering/minhash.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update src/distilabel/steps/filtering/minhash.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Add installation dependencies
* Apply comments from code review
* Add nltk as a dependency for the tests
* Update tests and interpretation of keep rows vs duplicates
* Remove disk backend from tests temporarily
* Add note in the docs related to minhash storage on disk
* Update tests to run on dict instead of disk as it never ends on CI
* Fix integration test
* Hide import inside of function to avoid installing it on docs building
* Update command to download nltk
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> docs: API reference review (#932)
* add headers labels API
* inherited members to false and ensure references from source
* ensure examples are rendered in components_gallery and API reference
* Remove space after Examples: in docstrings
* ensure citations are rendered
* note in CombineColumns and add missing steps
* fix \n usage in docstrings tasks, better \\n
* automatic LLM references for better maintenance
* fix distiset examples
* small fixes in galleries
* add embedding section and gallery
* add contributor docs
* fix typos and links
* Use HF Inference API instead of OpenAI in quickstart and README
* update extra steps
* add available models reference
* fix fais-gpu dependency
* upadate extras
* add colab button and align welcome page as argilla Refactor of MinHash to work with a single class and fix the shelve backend (#937)
* Initial work for minhash
* Add minhash step redirect
* Add first version of minhash and minhashlsh
* Add unit tests for minhash dedup
* Add pipeline testing deduplication
* Add tests to run with disk backend
* Add tests for the disk and ensure unload
* Add private _datasketch module to include a custom storage configuration for the minhash index
* Add docstrings to the internal classes/functions
* Add docstrings for the user facing classes
* Update src/distilabel/steps/filtering/minhash.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update src/distilabel/steps/filtering/minhash.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update tests/integration/test_deduplication.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update src/distilabel/steps/filtering/minhash.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update src/distilabel/steps/filtering/minhash.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Add installation dependencies
* Apply comments from code review
* Add nltk as a dependency for the tests
* Update tests and interpretation of keep rows vs duplicates
* Remove disk backend from tests temporarily
* Add note in the docs related to minhash storage on disk
* Update tests to run on dict instead of disk as it never ends on CI
* Fix integration test
* Hide import inside of function to avoid installing it on docs building
* Update command to download nltk
* Allow for a name in the shelve based backend to avoid overwrites
* Refactor MinHash to use a single MinHashDedup class that controls all the process
* Refactor tests to use the new class
* Redirect import to steps level
* Create new disk based storage using diskcache
* Add docstrings to clarify the difference between dict/disk
* Refactor to use diskcache
* Fix docstring example
* Update src/distilabel/steps/filtering/minhash.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Update definition of the step
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Update `make_generator_step` to set pipeline to step and add edge to steps in trophic level 1 (#936)
* Update `make_generator_step`
* Update unit tests4 months ago
by gabrielmbmb Offline batch generation (#923)
* Initial work for `offline_batch_generate`
* Add code for uploading Batch API files to OpenAI
* Add `offline_batch_inference` attribute
* `offline_batch_generate` finished for `OpenAILLM`
* Add attributes for checking task compatibility with
`offline_batch_generation`
* Extend `is_global` property for `offline_batch_generation`
* Move `job_ids` responsability to `LLM`
* And remember... `unload` everything before pickling
* Move `BASE_CACHE_DIR` to constants
* Recover for offline batch generation
* Polling sleep
* Store input data for recovering offline batch generation
* Lint
* Add checking no offline batch generation with `RayPipeline`
* Update `LLM`s unit tests
* Update `Task`s unit tests
* Add `OpenAILLM.offline_batch_generate` unit tests
* Fix unit test
* Add unit tests for adding recovery batch for offline generation
* Update tasks that can be used with offline batch generation
* Move aux functions to utils
* Handle `_SecretField` and excluded attributes when refreshing pipeline
from cache
* Fix checking inner type
* Add simple integration test
* Remove unit test
* Fix formatting exception
* Update type hint
* Handle stopping offline batch generation polling
* Use `_stop_called_lock` everywhere
* Fix deadlock
* Fix load
* Add Batch API example
* Update examples
* How to offline batch generation
* Add FAQ about OpenAI Batch API
* Update links
* Add `envs` module
* Add setting pipeline running env variables in child process
* Update OpenAI file upload to assign custom name
* Download nltk everytime
* Add missing arguments
* Update logging message
* Add section about offline batch generation
* Add errors and exceptions API docs
* Fix unit test
* Update mkdocs.yaml4 months ago
by gabrielmbmb Fix applying input mapping when mapping overrides another column (#938)
* Fix inputs rows overriden
* Add unit test
* Fix applying input mappings
* Fix `overriden_inputs`
* Fix unit test4 months ago
by gabrielmbmb Fix all replicas had the same `_llm_identifier` for `CudaDevicePlacementMixin` (#941)
* Fix CUDA device placement with multiple replicas
* Print replica id
* Copy `step` for each replica4 months ago
by gabrielmbmb Fix empty load stage when two `GlobalStep`s are chained (#945)4 months ago
by gabrielmbmb Update `TextGeneration` to deprecate `use_system_prompt` and add (#950)
`system_prompt` attribute4 months ago
by gabrielmbmb Add step to deduplicate records based on embeddings (#946)
* Redirect import
* Add train_size argument to allow training indices
* Fix error when retrieving info from a dataset fails creating a step from the make_generator_step helper
* Add embedding dedup step
* Add unit and integration tests for embedding dedup
* Apply comments from code review Updated `setup_logging` to use UTF-8 encoding in `FileHandler` (#952)
* Updated setup_logging to use UTF-8 in FileHandler
* Update src/distilabel/utils/logging.py
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Add more generation parameters to `vLLM` (#955)4 months ago
by gabrielmbmb Fix `Magpie` generating different columns depending on `LLM` output (#965)4 months ago
by gabrielmbmb Docs/962 docs create a smoother transition from index installation quickstart (#968)
* docs: swapped order of installation and quickstart
* chore: remove pipeline png file4 months ago
by davidberenstein1957 [DOCS] Add tips in the docs to avoid overloading Free Serverless Endpoints (#973)
* Make the regular expression more general to capture extra characters in the headers of the examples' sections
* Add tip in the example title for free inference endpoints
* Add FAQ entry for input batch size in serverless endpoints
* Change default LLM
* Update docs/sections/getting_started/faq.md
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Fix wording for SEO purposes
* Add installing libdnnl-dev
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> [FEATURE] Simplify customizing the `TextGeneration` task with custom prompts (#974)
* Simplify customization of TextGeneration
* Update tests loading the task
* Extra tests for the new functionality
* Added examples and extra checks
* Include missing attributes and info in docstrings
* Fix model_post_init call to super
* Force a template for the task
* Trying to fix the pickling error
* It's unused, but the argument of generate was wrongly spelled
* Checking if works without an instance of Template
* Remove template in unload to fix error on offline batch generation Update `system_prompt` attribute for adding probabilities in `MagpieBase` (#981)
* Update `system_prompt` so it can be a `Dict[str, Any]`
* Update docstrings and tests
* Update tests4 months ago
by gabrielmbmb Send as many `None`s as replicas in the step (#982)4 months ago
by gabrielmbmb docs: 960 docs add a glossary concept section (#970)
* feat: add _STEP_CATEGORY_TO_DESCRIPTION
* docs: emphasize structured generation
* docs: add pipeline visualizations
* docs: add concepts page outline
* Apply suggestions from code review
Co-authored-by: Natalia Elvira <126158523+nataliaElv@users.noreply.github.com>
* docs: processed comments Natalia
* chore: add tabulate dependency
* feat: add task overview
* docs: remove task overview from API definition
* docs: update naming tutorials
* docs: update naming
---------
Co-authored-by: Natalia Elvira <126158523+nataliaElv@users.noreply.github.com>4 months ago
by davidberenstein1957 Fix missing `system_prompt_key` column in `Magpie` tasks (#983)
* Fix missing `system_prompt_key` column
* Fix wrong `system_prompt_key` associated to conversation
* Update unit tests4 months ago
by gabrielmbmb docs: update component gallery (#987)4 months ago
by davidberenstein1957 docs: update install overview in readme4 months ago
by davidberenstein1957 docs: update installation overview4 months ago
by davidberenstein1957 Fix missing batch when last batch arrive early (#989) Fine personas socialai tutorial (#992)
* Remove pdm things
* Draft of socialai example
* Add example/post for socialai/fine personas
* Simplify title per code review feat: add basic draw implementation to pipline (#966)
* feat: add basic draw implementation to pipline
* refactor: cleanup some code
* feat: add functionality to draw TD or LR
* refactor: remove step name from vis
* refactor: default to LR generation
* Add dag with mapping
* feat: add edge labels
* Remove images
* feat: add support for leaf node to argilla and distilabel
* refactor: order of functions
* test: Add tests
* fix: replace logger warning for `warning.warn` to avoid non-initialized logger
* fix: avoid potentially getting raised errors during `get_outputs` call relying on dynamic calls
* docs: Add visualizing pipelines section
* feat: Add a try-except around pipeline visualization in Notebook to ensure it will never be a blocking action
* feat: add a show method to the pipleines for visualizing in notebooks
* docs: add more context on pipeline.show
* Apply suggestions from code review
Co-authored-by: Agus <agustin@argilla.io>
* Update src/distilabel/steps/generators/huggingface.py
* feat: remove show to simplify flow
* refactor: mermaid URL at top as constant
* feat: improve flow for passing by info to a potential next step
* docs: update docstring
---------
Co-authored-by: Agus <agustin@argilla.io>4 months ago
by davidberenstein1957 Fix schema inference structured generation (#994)
* fix: converting ModelMetaClass to model_json_schema
* fix: allow for adding optional literal format json to instructor to make methods more inter-changable
* docs: emphasize usability with any framework
* fix: first check if structured_output has been defined
* Update docs/sections/how_to_guides/advanced/structured_generation.md
Co-authored-by: Agus <agustin@argilla.io>
---------
Co-authored-by: Agus <agustin@argilla.io>3 months ago
by davidberenstein1957 [DOCS] Add developer documentation section in the docs (#999)
* Add new section with developer docs
* Fix name of link
* Add help for PR body Fix `vllm` installation in CI (#1009)3 months ago
by gabrielmbmb Fix writing `distilabel_metadata` column when `LLM` error (#1003)
* fix metadata writeout when llm error
* linter reformat
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Add example of custom text generation step in quickstart (#984) Fix`llvmlite` install with `uv` (#1018)
* Add `numba >= 0.54.0`
* Use `numpy < 2.0.0`
* Install vLLM first
* remove llm blender install3 months ago
by gabrielmbmb tests: validate passing questions and field within format_input too (#1017)
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>3 months ago
by davidberenstein1957 Fix impute when `output_mapping` is not empty (#1015) Add Tasks to replicate `APIGen` (#925)
* Add apigen task module
* Add tests for apigen
* Fix default name for dataset info when requesting the number of examples
* checkpoint
* Add tests for apigen generator
* Create jinja template, split methods and add docstrings
* Update string format
* Simplify function setting and move it to load method
* Add tests for semantic checker
* Add prompt template for semantic checker
* Redirect import for semantic checker
* Fix docstrins for output columns
* Add semantic checker task from apigen
* Add notes for execution checker
* Remove extra jump of line
* Add first version of data sampler, step helper for apigen
* Add tests for data sampler
* Add integration test to check the sampler can be mixed with another generator step
* Draft tests for new execution checker
* Move helper functions
* Draft for execution checker functionality
* Add first version of execution checker and tests
* Add tests for utils module of apigen
* Remove unnecessary step for transformation and rename files for clarity
* Fix import
* Change function results name to show the original results from the execution
* Remove print when the url for a reference doesn't contain https://arxiv
* first working version
* Fix tests including previous columns
* Go back to previous name for dummy llm
* Change dummy llm names on tests
* Read the answers from the model parsed instead of dumped string
* Add option to include the tools if available for few shot
* Allow extra checks for the parameter types and tests for those
* Add docs for the execution checker
* Add new icon for execution
* Fix return type for outputs column
* Fix docstrings
* Redirect imports to top level
* Update docstrings to render on components gallery
* Improve docstrings for fields in the data sampler
* Remove unnecesary data from docstrings and remove TODO
* Add missing data variable in example
* Update src/distilabel/steps/tasks/apigen/execution_checker.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
* Refactor to return formatted json string instead of dict to simplify work with arrow
* Draft tutorial to replicate paper
* Allow number to be a dict with values and probabilities
* Update pipeline run call
* Add functionality to load functions from a folder with .py files
* Fix comment for arg
* Add example implementation
* Add dependency for vllm
* Fix dependency name
* Add setuptools-scm in the script with the dependencies to install it prior to vllm
* Another attempt with system
* Add tests to take into account casting methods
* Avoid casting and update prompt to ensure argument order is respected
* Inform error type on generator
* Add extra checks and safeguards for failed answer generation
* Ensure the error is of the expected type
* Fix unstructured generation
* Remove json fences and fix semantic checker
* Control case of functions without arguments
* Add additional checks to run the execution checker
* Remove additional dependency
* Try fixing CI error with dependencies
* Install dependency for the system
* Undo fix attempt
* Try fixing llvmlite dependency issue
* Remove additional dependency as it breaks other tests
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Pretty print (#934)
* Add integration test to showcase the prompts
* Add a base print method so the Tasks can pretty print their prompts easily
* Update base method to allow automatic pretty printing
* Add optional argument instead of the default, update method name and return type
* Fix type hint
* Add example in docstrings
* Add section in docs for the print method Add `CLAIR` task (#926)
* Redirect import of CLAIR
* Add jinja2 template for CLAIR
* Add CLAIR task
* Add tests for CLAIR task
* Update example in docstrings
* Add tutorial to reproduce CLAIR
* Show new tutorial in the gallery and fix rendering issue in docstrings Add cache at `Step` level (#766)
* Add signature method for Serializable objects
* Update signature to only keep track of the step names and not it's internal info
* Refactor hash generation
* Add dummy batch manager from dag
* Update batch manager cache tests to start batch manager from a DAG
* Draft of integration tests for new caching
* Checkpoint draft
* Add cache directory location
* Add use_cache argument to Step for future use
* Change output names to keep track of them while debugging
* Make use of use_cache at the step level
* Add docstrings for internal batch manager arguments
* Remove path from add_batch method
* Move step caching to get_batch method in batch manager step
* Read batches from cached dir
* Set every step cache to False if the pipeline has the cache as False
* Comment for the batch manager
* Move back to caching from add_step
* Checkpoint current status
* Add use_cache on step
* If there's previous data saved, concatenate the content of the parquet files
* Only read the distiset from cache if all the steps are the same, otherwise overwrite
* Add changes to make loading a new and modified step feasible
* Set use cache to True by default
* Move logic of registering the batches to BasePipeline._register_batch to do it before calling _manage_batch_flows
* Avoid reading parquet file from cache when any of the steps has use_cach=False
* Add is_convergence method to DAG and cleanup batch_manager
* Add integration tests for the new caching mechanism
* Update unit tests related to register_batch
* Fix signature serialization case of void list
* Add use_cache to argilla tests
* Fix tests related to use_cache
* Fix tests
* Remove undefined object input
* Add `_invalidate_steps_cache_if_required` method
* Initial work for loading batches from `batch_manager_data` directory
* Draft cache updates
* Update pipeline signature
* Add signature mixin from other PR
* Moved pipeline cache to executions folder with different data per pipeline
* Testing new updates to read from cache
* Checkpoint with loading working while adding new steps
* Point of control
* Fix not all the batches where being saved
* Sort batches after loaded
* Fix `load_from_cache` to load batches from `steps_data` directory
correctly
* Update test
* Add `step_has_finished` method
* Update invalidate cache function
* Update integration caching tests
* Refactor to extract logic to methods
* Refactor to remove `cached_data_dir`
* Update stages message
* Refactor `invalidate_cache_for` method
* Fix `_BatchManager` unit tests
* Update to not serialize `exclude_from_signature` attribute
* Fix pipeline unit tests
* Remove write buffer data if `use_cache=False`
* Fix offline batch generation attributes were being not ignored by
signature
* Fix print test
* Fix routing batch function
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Fix `IndexError` when overriding inputs and `group_generations=False` (#1022)
* Fix processing num_generations when applying input mappings in steps process
* Add unit test
* Update comment
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Update `Pipeline cache` docs (#1023)
* Update link
* Update cache section
* Add step to fail if warnings
* Fix dependency name3 months ago
by gabrielmbmb Fix cross-reference3 months ago
by gabrielmbmb Bump version to `1.5.0`3 months ago
by gabrielmbmb Add common typing module (#1029)
* Add common typing module
* Update src/distilabel/typing.py
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
---------
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Update `docs` workflows to use `uv` (#1032)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>3 months ago
by gabrielmbmb Merge branch 'main' into develop3 months ago
by gabrielmbmb fix: simplify prompt template `ArgillaLabeller` (#1033)3 months ago
by davidberenstein1957 Add `dataset_batch_size` argument (#1039)3 months ago
by gabrielmbmb Move all LLMs to distilabel.models (#1045) Fix a tiny typo in `_Step` docstring (#1051)
Co-authored-by: plaguss <agustin@argilla.io>2 months ago
by sadra-barikbin docs: improve docs for `MinHashDedup` `Step` (#1050)
Co-authored-by: plaguss <agustin@argilla.io> Fix new response_format variable in openai api (#1053) [pre-commit.ci] pre-commit autoupdate (#1043)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>2 months ago
by pre-commit-ci[bot] Update `LLM.generate` output to include `statistics` (#1034)
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Add example of structured output. (#1061)
Co-authored-by: burtenshaw <ben@argilla.io> hotfix: import errors in llms2 months ago
by burtenshaw Fix `StepOutput` type (#1072) docs: update issue templates (#1074) Update `unload` method from `vLLM` to properly free resources (#1077)1 month ago
by gabrielmbmb Add tasks to replicate Math-shepherd (#1052)
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Add `load_groups` argument to `run` (#1075)
Co-authored-by: Agus <agustin@argilla.io>1 month ago
by gabrielmbmb Add TextGenerationWithImage task (#1066)
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Create columns with `LLM` returned extra keys (#1078)22 days ago
by gabrielmbmb Enable `RUF022` to automatically sort `__all__`22 days ago
by gabrielmbmb Fix `vLLM` unload logic when model is `None` (#1080)17 days ago
by gabrielmbmb Fix `merge_distilabel_metadata` function when handling outputs from `Task` with `group_generations==True` (#1082)16 days ago
by gabrielmbmb Merge branch 'main' into develop16 days ago
by gabrielmbmb