Avatar for the argilla-io user
argilla-io
distilabel
BlogDocsChangelog

`1.4.0`

#1024Merged
Comparing
develop
(
f1f7d77
) with
develop
(
925d259
)
CodSpeed Performance Gauge
-1%
Improvements
0
Regressions
0
Untouched
1
New
0
Dropped
0
Ignored
0

Benchmarks

Passed

test_cache_time
tests/integration/test_cache.py::test_cache_time
CodSpeed Performance Gauge
-1%
550.5 ms
555.3 ms

Commits

Click on a commit to change the comparison range
Base
develop
925d259
-3%
Bump version to `1.4.0`
ecbe16b
5 months ago
by gabrielmbmb
+2%
Temporary (using `pip`) fix for installing `llama-cpp-python` in CI (#886)
1198d24
5 months ago
by gabrielmbmb
+4%
Fix unit tests after release of `transformers==4.44.0` (#891) * Update unit tests so they work with `transformers>=4.44.0` * fix more unit tests
8916ff2
5 months ago
by gabrielmbmb
-2%
Fix default structured output (#892) * Add check for dependencies for structured outputs and change default value of structured outputs * Update tests with serialized default structured output --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
75baf64
5 months ago
by plaguss
-1%
Send as many batches as possible to input queues (#895) * Update `_manage_batch_flow` to send as many batches as can be built * Fix load stages * Fix unit test * Fix `argilla` unit test after release `2.0.1` * Can fail
7ff4d20
5 months ago
by gabrielmbmb
-2%
Exclude `repo_id` from `LoadDataFromFileSystem` (#898) * Exclude repo_id from LoadDataFromFileSystem generator class and update tests * Update code to be compatible with python 3.9
04d0bf0
5 months ago
by plaguss
+3%
Fix loader to read from a glob pattern (#877) * Fix loader to read from a glob pattern * Fix to read from general UPath instead of Path * Update tests to use glob patterns * Refactor to simplify check for glob pattern
f382f1c
5 months ago
by plaguss
-1%
Add `save_artifact` method to `_Step` (#871) * Add `save_artifact` method * Upload pipeline generated artifacts * Fix log file was being saved in different cache * Update `save_to_disk` to also save artifacts * Render artifacts in card * Update unit tests * Add missing unit tests * Update src/distilabel/distiset.py Co-authored-by: Agus <agustin@argilla.io> * Add section about saving artifacts * Add correct `edit_uri` --------- Co-authored-by: Agus <agustin@argilla.io>
c8df5a9
5 months ago
by gabrielmbmb
0%
Add new `add_raw_input` argument to `_Task` so we can automatically include the formatted input (#903) * Add attribute to include raw formatted input to distilabel_metadata field * Update tests to take into account add_raw_input attribute of tasks * Add reference to add_raw_input in the documentation * Update tests to control for the add_raw_input of the _Task
3d772c5
5 months ago
by plaguss
-4%
New `TruncateTextColumn` to truncate the length of texts using the number of tokens or characters (#902) * Add new category for text manipulation and sort the dict aplhabetically * Redirect import * Add new TruncateRow step to truncate the text using the number of characters or tokens * Add tests for TruncateRow * Update tokenizer name to avoid errors accessing the repo in CI * Update src/distilabel/steps/__init__.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/truncate.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Refactor tokenizer_name to tokenizer for consistency * Update test for the tokenizer refactor * Refactor TruncateRow to TruncateTextColumn --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
4740063
5 months ago
by plaguss
+6%
Update `inputs` and `outputs` interface to allow returning dict indicating optionality (#883) * Use `CudaDevicePlacementMixin` in `RewardModelScore` step * Add `StepColumns` type * Update inputs and outputs validation * Update type hints * Update inputs checking * Add unit test for checking inputs/outputs with dict * Update type hints * Update `inputs` and `outputs` return * Add missing inputs and outputs in docstring * Update docs
4093699
5 months ago
by gabrielmbmb
-1%
Update mistrallm (#904) * Update mistralai client to version 1.*.* * Update tests for new mistral client
ed874ba
5 months ago
by plaguss
-2%
Deepseek prover (#907) * Add deepseek prover autoformalization task * Add task for the scorer as a jinja template to make it easy to maintain * Add deepseek prover scorer task * Add tests for the scorer task * Redirect import * Create a folder for the deepseek-prover templates * Make generator task more general including few shot examples * Remove the few shot argument as we can determine by just checking for examples * Remove deepseek-prover from the core as they are not that relevant for general pipelines * Add deepseek prover pipeline * Add entry for the paper implementation * Remove tests * Remove import * Remove redirected import
10fff29
5 months ago
by plaguss
+3%
Update `RewardModelScore.inputs` to define optional input columns (#908)
974f0db
5 months ago
by gabrielmbmb
-3%
Add tutorial - generate data for training embeddings and reranking models (#893) * Add initial outline tutorial * Add section on data quality evaluation * Add conslusion * Update pipeline_samples structure for adding tutorials in a similar way as Argilla docs * Update new structure tutorials * Update title * Update to use Free serverless Inference API * Process comments from code review * Remove sections from header * Updated formatting examples * Add grid arror on new line * update phrasing * update phrasing
3264563
5 months ago
by davidberenstein1957
+3%
Fix load data from disk (#910) * Fix repo_id in load and make config argument optional if possible * Add tests for LoadFromDisk * Update src/distilabel/steps/generators/huggingface.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Make error more informative --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
ebe7e25
5 months ago
by plaguss
-2%
docs: minor fixes (#913) * Fix minor error deepseep prover * Fix minor type generate sentence pairs
516909e
5 months ago
by davidberenstein1957
0%
Add `URIAL` task (#921) * Initial work for `URIAL` * Update template * Fix checking last message * Add `format_output` logic * Refine `format_output` and add docstring * Add `References` * Add `URIAL` unit tests
2a3906d
4 months ago
by gabrielmbmb
0%
Add `vLLMEmbeddings` (#920) * Add vLLMEmbeddings to work with multiple GPUs * Add mocked tests
a796a75
4 months ago
by plaguss
0%
docs: add tutorials preference and clean (#917) * add tutorials * clean dataset tutorial * generate preference dataset tutorial * modify sentence pairs tutorial * add to index * add missing component * fix: first feedback * fix: add headers * fix: process for steps * fix: typo and note * add torch * fix typo
46d55ed
4 months ago
by sdiazlor
-1%
Fix `StructuredGeneration` examples and internal check (#912) * Fix error with instructor schema input * Fix examples of structured generation * Try inferring the type of format in case the user forgets informing about it
6576d1a
4 months ago
by plaguss
-4%
Generate deterministic pipeline name when it's not given (#878) * Generate deterministic pipeline name when it's not given * Use the names of the steps to generate the default pipeline name * Update test with the steps names * Add suggestion from code review
fc5d070
4 months ago
by plaguss
+12%
Add custom errors (#911) * Module to store custom errors from distilabel * Add tests for the new error types * Refactor ValueError to DistilabelUserError and provide a page in the docs with further info * Fix typo in docs * Refactor ValueError to DistilabelUserError with reference page * Add new error type for TypeErrors * Add DistilabelTypeError for base _Step * Add documentation section for step wrapper * Update src/distilabel/errors.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Apply comments from code review --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
22db32c
4 months ago
by plaguss
-7%
Merge branch 'main' into develop
def7060
4 months ago
by gabrielmbmb
+1%
Docs/tutorials fix (#922) * clean: duplicated header, installation, connect * preference: installation, viewer, connect * reranking: installation, tokenizer, connect * example dataset sentence pairs * workaround typing * typo * typo * cookbook feedback
af3515a
4 months ago
by sdiazlor
0%
Add Plausible as replacement for GA (#929)
2ce44f0
4 months ago
by davidberenstein1957
-2%
Add minhash related steps to deduplicate texts (#931) * Initial work for minhash * Add minhash step redirect * Add first version of minhash and minhashlsh * Add unit tests for minhash dedup * Add pipeline testing deduplication * Add tests to run with disk backend * Add tests for the disk and ensure unload * Add private _datasketch module to include a custom storage configuration for the minhash index * Add docstrings to the internal classes/functions * Add docstrings for the user facing classes * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update tests/integration/test_deduplication.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Add installation dependencies * Apply comments from code review * Add nltk as a dependency for the tests * Update tests and interpretation of keep rows vs duplicates * Remove disk backend from tests temporarily * Add note in the docs related to minhash storage on disk * Update tests to run on dict instead of disk as it never ends on CI * Fix integration test * Hide import inside of function to avoid installing it on docs building * Update command to download nltk --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
bb14e8b
4 months ago
by plaguss
+3%
docs: API reference review (#932) * add headers labels API * inherited members to false and ensure references from source * ensure examples are rendered in components_gallery and API reference * Remove space after Examples: in docstrings * ensure citations are rendered * note in CombineColumns and add missing steps * fix \n usage in docstrings tasks, better \\n * automatic LLM references for better maintenance * fix distiset examples * small fixes in galleries * add embedding section and gallery * add contributor docs * fix typos and links * Use HF Inference API instead of OpenAI in quickstart and README * update extra steps * add available models reference * fix fais-gpu dependency * upadate extras * add colab button and align welcome page as argilla
88615c7
4 months ago
by sdiazlor
-2%
Refactor of MinHash to work with a single class and fix the shelve backend (#937) * Initial work for minhash * Add minhash step redirect * Add first version of minhash and minhashlsh * Add unit tests for minhash dedup * Add pipeline testing deduplication * Add tests to run with disk backend * Add tests for the disk and ensure unload * Add private _datasketch module to include a custom storage configuration for the minhash index * Add docstrings to the internal classes/functions * Add docstrings for the user facing classes * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update tests/integration/test_deduplication.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Add installation dependencies * Apply comments from code review * Add nltk as a dependency for the tests * Update tests and interpretation of keep rows vs duplicates * Remove disk backend from tests temporarily * Add note in the docs related to minhash storage on disk * Update tests to run on dict instead of disk as it never ends on CI * Fix integration test * Hide import inside of function to avoid installing it on docs building * Update command to download nltk * Allow for a name in the shelve based backend to avoid overwrites * Refactor MinHash to use a single MinHashDedup class that controls all the process * Refactor tests to use the new class * Redirect import to steps level * Create new disk based storage using diskcache * Add docstrings to clarify the difference between dict/disk * Refactor to use diskcache * Fix docstring example * Update src/distilabel/steps/filtering/minhash.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update definition of the step --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
4b3c9c0
4 months ago
by plaguss
-1%
Update `make_generator_step` to set pipeline to step and add edge to steps in trophic level 1 (#936) * Update `make_generator_step` * Update unit tests
4556135
4 months ago
by gabrielmbmb
+3%
update regex (#940)
a2a8e86
4 months ago
by sdiazlor
-13%
Offline batch generation (#923) * Initial work for `offline_batch_generate` * Add code for uploading Batch API files to OpenAI * Add `offline_batch_inference` attribute * `offline_batch_generate` finished for `OpenAILLM` * Add attributes for checking task compatibility with `offline_batch_generation` * Extend `is_global` property for `offline_batch_generation` * Move `job_ids` responsability to `LLM` * And remember... `unload` everything before pickling * Move `BASE_CACHE_DIR` to constants * Recover for offline batch generation * Polling sleep * Store input data for recovering offline batch generation * Lint * Add checking no offline batch generation with `RayPipeline` * Update `LLM`s unit tests * Update `Task`s unit tests * Add `OpenAILLM.offline_batch_generate` unit tests * Fix unit test * Add unit tests for adding recovery batch for offline generation * Update tasks that can be used with offline batch generation * Move aux functions to utils * Handle `_SecretField` and excluded attributes when refreshing pipeline from cache * Fix checking inner type * Add simple integration test * Remove unit test * Fix formatting exception * Update type hint * Handle stopping offline batch generation polling * Use `_stop_called_lock` everywhere * Fix deadlock * Fix load * Add Batch API example * Update examples * How to offline batch generation * Add FAQ about OpenAI Batch API * Update links * Add `envs` module * Add setting pipeline running env variables in child process * Update OpenAI file upload to assign custom name * Download nltk everytime * Add missing arguments * Update logging message * Add section about offline batch generation * Add errors and exceptions API docs * Fix unit test * Update mkdocs.yaml
28485d0
4 months ago
by gabrielmbmb
+13%
Fix applying input mapping when mapping overrides another column (#938) * Fix inputs rows overriden * Add unit test * Fix applying input mappings * Fix `overriden_inputs` * Fix unit test
c8f4d61
4 months ago
by gabrielmbmb
+1%
Fix all replicas had the same `_llm_identifier` for `CudaDevicePlacementMixin` (#941) * Fix CUDA device placement with multiple replicas * Print replica id * Copy `step` for each replica
56b4036
4 months ago
by gabrielmbmb
-5%
Fix empty load stage when two `GlobalStep`s are chained (#945)
ebd2bb7
4 months ago
by gabrielmbmb
+6%
Update `TextGeneration` to deprecate `use_system_prompt` and add (#950) `system_prompt` attribute
973e0fa
4 months ago
by gabrielmbmb
-7%
Add step to deduplicate records based on embeddings (#946) * Redirect import * Add train_size argument to allow training indices * Fix error when retrieving info from a dataset fails creating a step from the make_generator_step helper * Add embedding dedup step * Add unit and integration tests for embedding dedup * Apply comments from code review
de2bed0
4 months ago
by plaguss
+3%
Updated `setup_logging` to use UTF-8 encoding in `FileHandler` (#952) * Updated setup_logging to use UTF-8 in FileHandler * Update src/distilabel/utils/logging.py --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
eef8961
4 months ago
by dameikle
+2%
Add more generation parameters to `vLLM` (#955)
8e9cc8d
4 months ago
by gabrielmbmb
-1%
Fix `Magpie` generating different columns depending on `LLM` output (#965)
f207fab
4 months ago
by gabrielmbmb
-1%
Docs/962 docs create a smoother transition from index installation quickstart (#968) * docs: swapped order of installation and quickstart * chore: remove pipeline png file
6e2c9b1
4 months ago
by davidberenstein1957
+2%
[DOCS] Add tips in the docs to avoid overloading Free Serverless Endpoints (#973) * Make the regular expression more general to capture extra characters in the headers of the examples' sections * Add tip in the example title for free inference endpoints * Add FAQ entry for input batch size in serverless endpoints * Change default LLM * Update docs/sections/getting_started/faq.md Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Fix wording for SEO purposes * Add installing libdnnl-dev --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
28ecbc4
4 months ago
by plaguss
-1%
[FEATURE] Simplify customizing the `TextGeneration` task with custom prompts (#974) * Simplify customization of TextGeneration * Update tests loading the task * Extra tests for the new functionality * Added examples and extra checks * Include missing attributes and info in docstrings * Fix model_post_init call to super * Force a template for the task * Trying to fix the pickling error * It's unused, but the argument of generate was wrongly spelled * Checking if works without an instance of Template * Remove template in unload to fix error on offline batch generation
af08b59
4 months ago
by plaguss
0%
Update `system_prompt` attribute for adding probabilities in `MagpieBase` (#981) * Update `system_prompt` so it can be a `Dict[str, Any]` * Update docstrings and tests * Update tests
e1253a6
4 months ago
by gabrielmbmb
-1%
Send as many `None`s as replicas in the step (#982)
75e34e1
4 months ago
by gabrielmbmb
-4%
docs: 960 docs add a glossary concept section (#970) * feat: add _STEP_CATEGORY_TO_DESCRIPTION * docs: emphasize structured generation * docs: add pipeline visualizations * docs: add concepts page outline * Apply suggestions from code review Co-authored-by: Natalia Elvira <126158523+nataliaElv@users.noreply.github.com> * docs: processed comments Natalia * chore: add tabulate dependency * feat: add task overview * docs: remove task overview from API definition * docs: update naming tutorials * docs: update naming --------- Co-authored-by: Natalia Elvira <126158523+nataliaElv@users.noreply.github.com>
b2d8eb5
4 months ago
by davidberenstein1957
+8%
Fix missing `system_prompt_key` column in `Magpie` tasks (#983) * Fix missing `system_prompt_key` column * Fix wrong `system_prompt_key` associated to conversation * Update unit tests
e67864e
4 months ago
by gabrielmbmb
-5%
docs: update component gallery (#987)
370e5b5
4 months ago
by davidberenstein1957
+2%
docs: update install overview in readme
33b58bf
4 months ago
by davidberenstein1957
-1%
docs: update installation overview
a2ab68d
4 months ago
by davidberenstein1957
0%
Fix missing batch when last batch arrive early (#989)
f997cfd
4 months ago
by zye1996
+1%
Fine personas socialai tutorial (#992) * Remove pdm things * Draft of socialai example * Add example/post for socialai/fine personas * Simplify title per code review
ad231ab
4 months ago
by plaguss
-3%
feat: add basic draw implementation to pipline (#966) * feat: add basic draw implementation to pipline * refactor: cleanup some code * feat: add functionality to draw TD or LR * refactor: remove step name from vis * refactor: default to LR generation * Add dag with mapping * feat: add edge labels * Remove images * feat: add support for leaf node to argilla and distilabel * refactor: order of functions * test: Add tests * fix: replace logger warning for `warning.warn` to avoid non-initialized logger * fix: avoid potentially getting raised errors during `get_outputs` call relying on dynamic calls * docs: Add visualizing pipelines section * feat: Add a try-except around pipeline visualization in Notebook to ensure it will never be a blocking action * feat: add a show method to the pipleines for visualizing in notebooks * docs: add more context on pipeline.show * Apply suggestions from code review Co-authored-by: Agus <agustin@argilla.io> * Update src/distilabel/steps/generators/huggingface.py * feat: remove show to simplify flow * refactor: mermaid URL at top as constant * feat: improve flow for passing by info to a potential next step * docs: update docstring --------- Co-authored-by: Agus <agustin@argilla.io>
c7deafa
4 months ago
by davidberenstein1957
+2%
Fix schema inference structured generation (#994) * fix: converting ModelMetaClass to model_json_schema * fix: allow for adding optional literal format json to instructor to make methods more inter-changable * docs: emphasize usability with any framework * fix: first check if structured_output has been defined * Update docs/sections/how_to_guides/advanced/structured_generation.md Co-authored-by: Agus <agustin@argilla.io> --------- Co-authored-by: Agus <agustin@argilla.io>
d7e61b5
3 months ago
by davidberenstein1957
0%
[DOCS] Add developer documentation section in the docs (#999) * Add new section with developer docs * Fix name of link * Add help for PR body
a178109
3 months ago
by plaguss
-2%
Fix `vllm` installation in CI (#1009)
a49242d
3 months ago
by gabrielmbmb
+2%
Fix writing `distilabel_metadata` column when `LLM` error (#1003) * fix metadata writeout when llm error * linter reformat --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
3244c05
3 months ago
by zye1996
0%
Add example of custom text generation step in quickstart (#984)
3fd680c
3 months ago
by plaguss
0%
Fix`llvmlite` install with `uv` (#1018) * Add `numba >= 0.54.0` * Use `numpy < 2.0.0` * Install vLLM first * remove llm blender install
4848dd2
3 months ago
by gabrielmbmb
-1%
tests: validate passing questions and field within format_input too (#1017) Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
d5c0484
3 months ago
by davidberenstein1957
+1%
Fix impute when `output_mapping` is not empty (#1015)
4b8903b
3 months ago
by zye1996
0%
Add Tasks to replicate `APIGen` (#925) * Add apigen task module * Add tests for apigen * Fix default name for dataset info when requesting the number of examples * checkpoint * Add tests for apigen generator * Create jinja template, split methods and add docstrings * Update string format * Simplify function setting and move it to load method * Add tests for semantic checker * Add prompt template for semantic checker * Redirect import for semantic checker * Fix docstrins for output columns * Add semantic checker task from apigen * Add notes for execution checker * Remove extra jump of line * Add first version of data sampler, step helper for apigen * Add tests for data sampler * Add integration test to check the sampler can be mixed with another generator step * Draft tests for new execution checker * Move helper functions * Draft for execution checker functionality * Add first version of execution checker and tests * Add tests for utils module of apigen * Remove unnecessary step for transformation and rename files for clarity * Fix import * Change function results name to show the original results from the execution * Remove print when the url for a reference doesn't contain https://arxiv * first working version * Fix tests including previous columns * Go back to previous name for dummy llm * Change dummy llm names on tests * Read the answers from the model parsed instead of dumped string * Add option to include the tools if available for few shot * Allow extra checks for the parameter types and tests for those * Add docs for the execution checker * Add new icon for execution * Fix return type for outputs column * Fix docstrings * Redirect imports to top level * Update docstrings to render on components gallery * Improve docstrings for fields in the data sampler * Remove unnecesary data from docstrings and remove TODO * Add missing data variable in example * Update src/distilabel/steps/tasks/apigen/execution_checker.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Refactor to return formatted json string instead of dict to simplify work with arrow * Draft tutorial to replicate paper * Allow number to be a dict with values and probabilities * Update pipeline run call * Add functionality to load functions from a folder with .py files * Fix comment for arg * Add example implementation * Add dependency for vllm * Fix dependency name * Add setuptools-scm in the script with the dependencies to install it prior to vllm * Another attempt with system * Add tests to take into account casting methods * Avoid casting and update prompt to ensure argument order is respected * Inform error type on generator * Add extra checks and safeguards for failed answer generation * Ensure the error is of the expected type * Fix unstructured generation * Remove json fences and fix semantic checker * Control case of functions without arguments * Add additional checks to run the execution checker * Remove additional dependency * Try fixing CI error with dependencies * Install dependency for the system * Undo fix attempt * Try fixing llvmlite dependency issue * Remove additional dependency as it breaks other tests --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
4b056ff
3 months ago
by plaguss
-1%
Pretty print (#934) * Add integration test to showcase the prompts * Add a base print method so the Tasks can pretty print their prompts easily * Update base method to allow automatic pretty printing * Add optional argument instead of the default, update method name and return type * Fix type hint * Add example in docstrings * Add section in docs for the print method
87683f0
3 months ago
by plaguss
+1%
Add `CLAIR` task (#926) * Redirect import of CLAIR * Add jinja2 template for CLAIR * Add CLAIR task * Add tests for CLAIR task * Update example in docstrings * Add tutorial to reproduce CLAIR * Show new tutorial in the gallery and fix rendering issue in docstrings
e027f99
3 months ago
by plaguss
-26%
Add cache at `Step` level (#766) * Add signature method for Serializable objects * Update signature to only keep track of the step names and not it's internal info * Refactor hash generation * Add dummy batch manager from dag * Update batch manager cache tests to start batch manager from a DAG * Draft of integration tests for new caching * Checkpoint draft * Add cache directory location * Add use_cache argument to Step for future use * Change output names to keep track of them while debugging * Make use of use_cache at the step level * Add docstrings for internal batch manager arguments * Remove path from add_batch method * Move step caching to get_batch method in batch manager step * Read batches from cached dir * Set every step cache to False if the pipeline has the cache as False * Comment for the batch manager * Move back to caching from add_step * Checkpoint current status * Add use_cache on step * If there's previous data saved, concatenate the content of the parquet files * Only read the distiset from cache if all the steps are the same, otherwise overwrite * Add changes to make loading a new and modified step feasible * Set use cache to True by default * Move logic of registering the batches to BasePipeline._register_batch to do it before calling _manage_batch_flows * Avoid reading parquet file from cache when any of the steps has use_cach=False * Add is_convergence method to DAG and cleanup batch_manager * Add integration tests for the new caching mechanism * Update unit tests related to register_batch * Fix signature serialization case of void list * Add use_cache to argilla tests * Fix tests related to use_cache * Fix tests * Remove undefined object input * Add `_invalidate_steps_cache_if_required` method * Initial work for loading batches from `batch_manager_data` directory * Draft cache updates * Update pipeline signature * Add signature mixin from other PR * Moved pipeline cache to executions folder with different data per pipeline * Testing new updates to read from cache * Checkpoint with loading working while adding new steps * Point of control * Fix not all the batches where being saved * Sort batches after loaded * Fix `load_from_cache` to load batches from `steps_data` directory correctly * Update test * Add `step_has_finished` method * Update invalidate cache function * Update integration caching tests * Refactor to extract logic to methods * Refactor to remove `cached_data_dir` * Update stages message * Refactor `invalidate_cache_for` method * Fix `_BatchManager` unit tests * Update to not serialize `exclude_from_signature` attribute * Fix pipeline unit tests * Remove write buffer data if `use_cache=False` * Fix offline batch generation attributes were being not ignored by signature * Fix print test * Fix routing batch function --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
ebab004
3 months ago
by plaguss
+23%
Fix `IndexError` when overriding inputs and `group_generations=False` (#1022) * Fix processing num_generations when applying input mappings in steps process * Add unit test * Update comment --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
4cbcb90
3 months ago
by plaguss
-33%
Update `Pipeline cache` docs (#1023) * Update link * Update cache section * Add step to fail if warnings * Fix dependency name
d99011c
3 months ago
by gabrielmbmb
+2%
Fix cross-reference
6ef15f4
3 months ago
by gabrielmbmb
+33%
Bump version to `1.5.0`
303722c
3 months ago
by gabrielmbmb
-2%
Add common typing module (#1029) * Add common typing module * Update src/distilabel/typing.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
106e402
3 months ago
by plaguss
+1%
Update `docs` workflows to use `uv` (#1032) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
dc06161
3 months ago
by gabrielmbmb
+1%
Merge branch 'main' into develop
2cedc7d
3 months ago
by gabrielmbmb
+2%
fix: simplify prompt template `ArgillaLabeller` (#1033)
9840129
3 months ago
by davidberenstein1957
-2%
Add `dataset_batch_size` argument (#1039)
1f75593
3 months ago
by gabrielmbmb
0%
Move all LLMs to distilabel.models (#1045)
7c8976b
2 months ago
by plaguss
+1%
Fix a tiny typo in `_Step` docstring (#1051) Co-authored-by: plaguss <agustin@argilla.io>
f949640
2 months ago
by sadra-barikbin
0%
docs: improve docs for `MinHashDedup` `Step` (#1050) Co-authored-by: plaguss <agustin@argilla.io>
f1397e9
2 months ago
by anakin87
-2%
Fix new response_format variable in openai api (#1053)
53e46c1
2 months ago
by plaguss
+2%
[pre-commit.ci] pre-commit autoupdate (#1043) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
e830e25
2 months ago
by pre-commit-ci[bot]
+1%
Update `LLM.generate` output to include `statistics` (#1034) Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
2469407
2 months ago
by plaguss
-2%
Add example of structured output. (#1061) Co-authored-by: burtenshaw <ben@argilla.io>
cb4ba1b
2 months ago
by plaguss
0%
hotfix: import errors in llms
a8d02c2
2 months ago
by burtenshaw
-2%
Fix `StepOutput` type (#1072)
8dd6405
1 month ago
by plaguss
-1%
docs: update issue templates (#1074)
fa13ae1
1 month ago
by sdiazlor
+8%
Update `unload` method from `vLLM` to properly free resources (#1077)
f8e41cd
1 month ago
by gabrielmbmb
-7%
Add tasks to replicate Math-shepherd (#1052) Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
6bb61d1
1 month ago
by plaguss
+4%
Add `load_groups` argument to `run` (#1075) Co-authored-by: Agus <agustin@argilla.io>
55d9e5d
1 month ago
by gabrielmbmb
-88%
Add TextGenerationWithImage task (#1066) Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
63c75c5
26 days ago
by plaguss
+710%
Create columns with `LLM` returned extra keys (#1078)
a8588fd
22 days ago
by gabrielmbmb
-625%
Enable `RUF022` to automatically sort `__all__`
cf28976
22 days ago
by gabrielmbmb
-73%
Fix `vLLM` unload logic when model is `None` (#1080)
c2ae3f1
17 days ago
by gabrielmbmb
+365%
Fix `merge_distilabel_metadata` function when handling outputs from `Task` with `group_generations==True` (#1082)
925d259
16 days ago
by gabrielmbmb
-291%
Merge branch 'main' into develop
f1f7d77
16 days ago
by gabrielmbmb
Home Terms PrivacyDocs