Getting Started
Installation
Install the latest version via pip...
pip install be-great
... or download the source code from GitHub
git clone https://github.com/tabularis-ai/be_great.git
Requirements
GReaT requires Python 3.9 (or higher) and the following packages:
- datasets >= 2.5.2
- numpy >= 1.23.1
- pandas >= 1.4.4
- scikit_learn >= 1.1.1
- scipy >= 1.9.0
- torch >= 1.10.2
- tqdm >= 4.64.1
- transformers >= 4.22.1
- accelerate >= 0.20.1
- fsspec >= 2024.5.0
Optional:
- peft >= 0.14.0 (for LoRA fine-tuning)
Quickstart
In the example below, we show how the GReaT approach is used to generate synthetic tabular data for the California Housing dataset.
from be_great import GReaT
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True).frame
model = GReaT(llm='distilgpt2', batch_size=32, epochs=50, fp16=True)
model.fit(data)
synthetic_data = model.sample(n_samples=100)
Random Preconditioning
During training, GReaT conditions on a single column by default. This can lead to overfitting on that column. Enable random preconditioning to select a different column each epoch:
model.fit(data, random_conditional_col=True)
Guided Sampling & Float Precision
For small datasets or datasets with many features, use guided sampling and limited float precision:
model = GReaT(
llm='distilgpt2',
float_precision=3,
batch_size=8,
epochs=100,
fp16=True,
)
model.fit(data)
synthetic_data = model.sample(
n_samples=100,
guided_sampling=True,
random_feature_order=True,
temperature=0.7,
)
LoRA Fine-Tuning
GReaT supports LoRA for parameter-efficient fine-tuning, reducing memory usage and training time:
pip install peft
model = GReaT(
llm='distilgpt2',
batch_size=32,
epochs=5,
efficient_finetuning="lora",
lora_config={"r": 8, "lora_alpha": 16, "lora_dropout": 0.1},
fp16=True,
)
model.fit(data)
synthetic_data = model.sample(n_samples=100)
Target modules are auto-detected for common architectures (GPT-2, LLaMA, Falcon, etc.), or can be specified explicitly via lora_config["target_modules"].
Evaluating Synthetic Data
After generating synthetic data, use the built-in metrics to measure quality and privacy:
from be_great.metrics import ColumnShapes, DiscriminatorMetric, MLEfficiency, DistanceToClosestRecord
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Compare column distributions (KS test / TVD)
ColumnShapes().compute(data, synthetic_data)
# Train a classifier to distinguish real vs synthetic (0.5 = best)
DiscriminatorMetric().compute(data, synthetic_data)
# Train on synthetic, test on real
MLEfficiency(
model=RandomForestClassifier,
metric=accuracy_score,
model_params={"n_estimators": 100},
).compute(data, synthetic_data, label_col="target")
# Check privacy: distance to closest real record
DistanceToClosestRecord().compute(data, synthetic_data)
See Examples to find more details.