Machine Learning¶
CivisML uses the Civis Platform to train machine learning models and parallelize their predictions over large datasets. It contains best-practice models for general-purpose classification and regression modeling as well as model quality evaluations and visualizations. All CivisML models use scikit-learn for interoperability with other platforms and to allow you to leverage resources in the open-source software community when creating machine learning models.
Define Your Model¶
Start the modeling process by defining your model. Do this by creating an instance
of the ModelPipeline
class. Each ModelPipeline
corresponds to a
scikit-learn Pipeline
which will run in Civis Platform.
A Pipeline
allows you to combine multiple
modeling steps (such as missing value imputation and feature selection) into a
single model. The Pipeline
is treated as a unit – for example,
cross-validation happens over all steps together.
You can define your model in two ways, either by selecting a pre-defined algorithm
or by providing your own scikit-learn
Pipeline
or BaseEstimator
object.
Note that whichever option you chose, CivisML will pre-process your data to
one-hot-encode categorical features (the non-numerical columns) to binary indicator columns
before sending the features to the Pipeline
.
Pre-Defined Models¶
You can use the following pre-defined models with CivisML.
All models start by imputing missing values with the mean of non-null
values in a column. The “sparse_*” models include a LASSO regression step
(using the glmnet package)
to do feature selection before passing data to the final model.
In some models, CivisML uses default parameters different from those in scikit-learn,
as indicated in the “Altered Defaults” column. All models also have random_state=42
.
Name | Model Type | Algorithm | Altered Defaults |
---|---|---|---|
sparse_logistic | classification | LogisticRegression | C=499999950, tol=1e-08 |
gradient_boosting_classifier | classification | GradientBoostingClassifier | n_estimators=500, max_depth=2 |
random_forest_classifier | classification | RandomForestClassifier | n_estimators=500 |
extra_trees_classifier | classification | ExtraTreesClassifier | n_estimators=500 |
sparse_linear_regressor | regression | LinearRegression | |
sparse_ridge_regressor | regression | Ridge | |
gradient_boosting_regressor | regression | GradientBoostingRegressor | n_estimators=500, max_depth=2 |
random_forest_regressor | regression | RandomForestRegressor | n_estimators=500 |
extra_trees_regressor | regression | ExtraTreesRegressor | n_estimators=500 |
Custom Models¶
You can create your own Pipeline
instead of using one of
the pre-defined ones. Create the object and pass it as the model
parameter
of the ModelPipeline
. Your model must be built from libraries which CivisML
recognizes. You can use code from
- scikit-learn v0.18.1
- glmnet v2.0.0
- xgboost v0.6a2
- muffnn v1.1.1
When you’re assembling your own model, remember that you’ll have to make certain that
either you add a missing value imputation step or that your data doesn’t have any
missing values. If you’re making a classification model, the model must have a predict_proba
method. If the class you’re using doesn’t have a predict_proba
method,
you can add one by wrapping it in a CalibratedClassifierCV
.
Custom Dependencies¶
Installing packages from PyPI is straightforward. You can specify a dependencies argument to ~civis.ml.ModelPipeline which will install the dependencies in your runtime environment. VCS support is also enabled (see [docs](https://pip.pypa.io/en/stable/reference/pip_install/#vcs-support).) Installing a remote git repository from, say, Github only requires passing the HTTPS URL in the form of, for example, git+https://github.com/scikit-learn/scikit-learn.
CivisML will run pip install [your package here]. We strongly encourage you to pin package versions for consistency. Example code looks like::
from civis.ml import ModelPipeline
from pyearth import Earth
deps = ['git+https://github.com/scikit-learn-contrib/py-earth.git@da856e11b2a5d16aba07f51c3c15cef5e40550c7']
est = Earth()
model = ModelPipeline(est, dependent_variable='age', dependencies=deps)
train = model.train(table_name='donors.from_march', database_name='client')
Additionally, you can store a remote git host’s API token in the Civis Platform as a credential to use for installing private git repositores. For example, you can go to Github at the https://github.com/settings/tokens URL, copy your token into the password field of a credential, and pass the credential name to the git_token_name argument in ~civis.ml.ModelPipeline. This also works with other hosting services. A simple example of how to do this with API looks as follows:
import civis
password = 'abc123' # token copied from https://github.com/settings/tokens
username = 'user123' # Github username
git_token_name = 'Github credential'
client = civis.APIClient()
credential = client.credentials.post(password=password,
username=username,
name=git_token_name,
type="Custom")
pipeline = civis.ml.ModelPipeline(..., git_token_name=git_token_name)
Note, installing private dependencies with submodules is not supported.
Asynchronous Execution¶
All calls to a ModelPipeline
object are non-blocking, i.e. they immediately
provide a result without waiting for the job in the Civis Platform to complete.
Calls to civis.ml.ModelPipeline.train()
and civis.ml.ModelPipeline.predict()
return
a ModelFuture
object, which is a subclass of
Future
from the Python standard library.
This behavior lets you train multiple models at once, or generate predictions
from models, while still doing other work while waiting for your jobs to complete.
The ModelFuture
can find and retrieve outputs from your CivisML jobs,
such as trained Pipeline
objects or out-of-sample predictions.
The ModelFuture
only downloads outputs when you request them.
Model Persistence¶
Civis Platform permanently stores all models, indexed by the job ID and the run ID
(also called a “build”) of the training job. If you wish to use an existing
model, call civis.ml.ModelPipeline.from_existing()
with the job ID of the training job.
You can find the job ID with the train_job_id
attribute of a ModelFuture
,
or by looking at the URL of your model on the
Civis Platform models page.
If the training job has multiple runs, you may also provide a run ID to select
a run other than the most recent.
You can list all model runs of a training job by calling
civis.APIClient().jobs.get(train_job_id)['runs']
.
You may also store the ModelPipeline
itself with the pickle
module.
Examples¶
Future
objects have the method
add_done_callback()
.
This is called as soon as the run completes. It takes a single argument, the
Future
for the completed job.
You can use this method to chain jobs together:
from concurrent import futures
from civis.ml import ModelPipeline
import pandas as pd
df = pd.read_csv('data.csv')
training, predictions = [], []
model = ModelPipeline('sparse_logistic', dependent_variable='type')
training.append(model.train(df))
training[-1].add_done_callback(lambda fut: predictions.append(model.predict(df)))
futures.wait(training) # Blocks until all training jobs complete
futures.wait(predictions) # Blocks until all prediction jobs complete
You can create and train multiple models at once to find the best approach for solving a problem. For example:
from civis.ml import ModelPipeline
algorithms = ['gradient_boosting_classifier', 'sparse_logistic', 'random_forest_classifier']
pkey = 'person_id'
depvar = 'likes_cats'
models = [ModelPipeline(alg, primary_key=pkey, dependent_variable=depvar) for alg in algorithms]
train = [model.train(table_name='schema.name', database_name='My DB') for model in models]
aucs = [tr.metrics['roc_auc'] for tr in train] # Code blocks here
Optional dependencies¶
You do not need any external libraries installed to use CivisML, but
the following pip-installable dependencies enhance the capabilities of the
ModelPipeline
:
- pandas
- scikit-learn
- glmnet
Install pandas
if you wish to download tables of predictions.
You can also model on DataFrame
objects in your interpreter.
If you wish to use custom models or download trained models, you’ll need scikit-learn installed.
The “sparse_logistic”, “sparse_linear_regressor”, and “sparse_ridge_regressor” models
all use the public Civis Analytics glmnet
library. Install it if you wish to download
a model created from one of these pre-defined models.
Object reference¶
-
class
civis.ml.
ModelPipeline
(model, dependent_variable, primary_key=None, parameters=None, cross_validation_parameters=None, model_name=None, calibration=None, excluded_columns=None, client=None, cpu_requested=None, memory_requested=None, disk_requested=None, notifications=None, dependencies=None, git_token_name=None, verbose=False)[source]¶ Interface for scikit-learn modeling in the Civis Platform
Each ModelPipeline corresponds to a scikit-learn
Pipeline
which will run in Civis Platform.Parameters: model : string or Estimator
Either the name of a pre-defined model (e.g. “sparse_logistic” or “gradient_boosting_classifier”) or else a pre-existing Estimator object.
dependent_variable : string or List[str]
The dependent variable of the training dataset. For a multi-target problem, this should be a list of column names of dependent variables.
primary_key : string, optional
The unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores.
parameters : dict, optional
Specify parameters for the final stage estimator in a predefined model, e.g.
{'C': 2}
for a “sparse_logistic” model.cross_validation_parameters : dict, optional
Cross validation parameter grid for learner parameters, e.g.
{{'n_estimators': [100, 200, 500], 'learning_rate': [0.01, 0.1], 'max_depth': [2, 3]}}
.model_name : string, optional
The prefix of the Platform modeling jobs. It will have ” Train” or ” Predict” added to become the Script title.
calibration : {None, “sigmoid”, “isotonic”}
If not None, calibrate output probabilities with the selected method. Valid only with classification models.
excluded_columns : array, optional
A list of columns which will be considered ineligible to be independent variables.
client :
APIClient
, optionalIf not provided, an
APIClient
object will be created from theCIVIS_API_KEY
.cpu_requested : int, optional
Number of CPU shares requested in the Civis Platform for training jobs. 1024 shares = 1 CPU.
memory_requested : int, optional
Memory requested from Civis Platform for training jobs, in MiB
disk_requested : float, optional
Disk space requested on Civis Platform for training jobs, in GB
notifications : dict
See
post_custom()
for further documentation about email and URL notification.- dependencies : array, optional
List of packages to install from PyPI or git repository (i.e., Github or Bitbucket). If a private repo is specificed, please include a
git_token_name
argument as well (see below).- git_token_name : str, optional
Name of remote git API token stored in platform as the password field in a custom platform credential. Used only when installing private git repositories.
verbose : bool, optional
If True, supply debug outputs in Platform logs and make prediction child jobs visible.
See also
Examples
>>> from civis.ml import ModelPipeline >>> model = ModelPipeline('gradient_boosting_classifier', 'depvar', ... primary_key='voterbase_id') >>> train = model.train(table_name='schema.survey_data', ... fit_params={'sample_weight': 'survey_weight'}, ... database_name='My Redshift Cluster', ... oos_scores='scratch.survey_depvar_oos_scores') >>> train <ModelFuture at 0x11be7ae10 state=queued> >>> train.running() True >>> train.done() False >>> df = train.table # Read OOS scores from its Civis File. Blocking. >>> meta = train.metadata # Metadata from training run >>> train.metrics['roc_auc'] 0.88425 >>> pred = model.predict(table_name='schema.demographics_table ', ... database_name='My Redshift Cluster', ... output_table='schema.predicted_survey_response', ... if_exists='drop', ... n_jobs=50) >>> df_pred = pred.table # Blocks until finished # Modify the parameters of the base estimator in a default model: >>> model = ModelPipeline('sparse_logistic', 'depvar', ... primary_key='voterbase_id', ... parameters={'C': 2}) # Grid search over hyperparameters in the base estimator: >>> model = ModelPipeline('sparse_logistic', 'depvar', ... primary_key='voterbase_id', ... cross_validation_parameters={'C': [0.1, 1, 10]})
Attributes
estimator ( Pipeline
) The trained scikit-learn Pipelinetrain_result_ ( ModelFuture
)ModelFuture
encapsulating this model’s training runstate (str) Status of the training job (non-blocking) Methods
train() Train the model on data in Civis Platform; outputs ModelFuture
predict() Make predictions on new data; outputs ModelFuture
from_existing() Class method; use to create a ModelPipeline
from an existing model training run-
classmethod
from_existing
(train_job_id, train_run_id='latest', client=None)[source]¶ Create a
ModelPipeline
object from existing model IDsParameters: train_job_id : int
The ID of the CivisML job in the Civis Platform
train_run_id : int or string, optional
Location of the model run, either
- an explicit run ID,
- “latest” : The most recent run
- “active” : The run designated by the training job’s “active build” parameter
client :
APIClient
, optionalIf not provided, an
APIClient
object will be created from theCIVIS_API_KEY
.Returns: A
ModelPipeline
which refers to a previously-trained modelExamples
>>> from civis.ml import ModelPipeline >>> model = ModelPipeline.from_existing(job_id) >>> model.train_result_.metrics['roc_auc'] 0.843
-
predict
(df=None, csv_path=None, table_name=None, database_name=None, manifest=None, file_id=None, sql_where=None, sql_limit=None, primary_key=Sentinel(), output_table=None, output_db=None, if_exists='fail', n_jobs=None, polling_interval=None)[source]¶ Make predictions on a trained model
Provide input through one of a
DataFrame
(df
), a local CSV (csv_path
), a Civis Table (table_name
anddatabase_name
), a Civis File containing a CSV (file_id
), or a Civis File containing a manifest file (manifest
).A “manifest file” is JSON which specifies the location of many shards of the data to be used for prediction. A manifest file is the output of a Civis export job with
force_multifile=True
set, e.g. fromcivis.io.civis_to_multifile_csv()
. Large Civis Tables (provided usingtable_name
) will automatically be exported to manifest files.Prediction outputs will always be stored as gzipped CSVs in one or more Civis Files. You can find a list of File ID numbers for output files at the “output_file_ids” key in the metadata returned by the prediction job. Provide an
output_table
(and optionally anoutput_db
, if it’s different fromdatabase_name
) to copy these predictions into a Civis Table.Parameters: df : pd.DataFrame, optional
csv_path : str, optional
The location of a CSV of data on the local disk. It will be uploaded to a Civis file.
table_name : str, optional
The qualified name of the table containing your data
database_name : str, optional
Name of the database holding the data, e.g., ‘My Redshift Cluster’.
manifest : int, optional
ID for a manifest file stored as a Civis file. (Note: if the manifest is not a Civis Platform-specific manifest, like the one returned from
civis.io.civis_to_multfile_csv()
, this must be used in conjunction with table_name and database_name due to the need for column discovery via Redshift.)file_id : int, optional
If the data are a CSV stored in a Civis file, provide the integer file ID.
sql_where : str, optional
A SQL WHERE clause used to scope the rows to be predicted
sql_limit : int, optional
SQL LIMIT clause to restrict the size of the prediction set
primary_key : str, optional
Primary key of the prediction table. Defaults to the primary key of the training data. Use
None
to indicate that the prediction data don’t have a primary key column.output_table: str, optional
The table in which to put the predictions.
output_db : str, optional
Database of the output table. Defaults to the database of the input table.
if_exists : {‘fail’, ‘append’, ‘drop’, ‘truncate’}
Action to take if the prediction table already exists.
n_jobs : int, optional
Number of concurrent Platform jobs to use for multi-file / large table prediction.
polling_interval : float, optional
Check for job completion every this number of seconds. Do not set if using the notifications endpoint.
Returns:
-
train
(df=None, csv_path=None, table_name=None, database_name=None, file_id=None, sql_where=None, sql_limit=None, oos_scores=None, oos_scores_db=None, if_exists='fail', fit_params=None, polling_interval=None)[source]¶ Start a Civis Platform job to train your model
Provide input through one of a
DataFrame
(df
), a local CSV (csv_path
), a Civis Table (table_name
anddatabase_name
), or a Civis File containing a CSV (file_id
).Model outputs will always contain out-of-sample scores (accessible through
ModelFuture.table
on this function’s output), and you may chose to store these out-of-sample scores in a Civis Table with theoos_scores
,oos_scores_db
, andif_exists
parameters.Parameters: df : pd.DataFrame, optional
csv_path : str, optional
The location of a CSV of data on the local disk. It will be uploaded to a Civis file.
table_name : str, optional
The qualified name of the table containing the training set from which to build the model.
database_name : str, optional
Name of the database holding the training set table used to build the model. E.g., ‘My Cluster Name’.
file_id : int, optional
If the training data are stored in a Civis file, provide the integer file ID.
sql_where : str, optional
A SQL WHERE clause used to scope the rows of the training set (used for table input only)
sql_limit : int, optional
SQL LIMIT clause for querying the training set (used for table input only)
oos_scores : str, optional
If provided, store out-of-sample predictions on training set data to this Redshift “schema.tablename”.
oos_scores_db : str, optional
If not provided, store OOS predictions in the same database which holds the training data.
if_exists : {‘fail’, ‘append’, ‘drop’, ‘truncate’}
Action to take if the out-of-sample prediction table already exists.
fit_params: Dict[str, str]
Mapping from parameter names in the model’s
fit
method to the column names which hold the data, e.g.{'sample_weight': 'survey_weight_column'}
.polling_interval : float, optional
Check for job completion every this number of seconds. Do not set if using the notifications endpoint.
Returns:
-
class
civis.ml.
ModelFuture
(job_id, run_id, train_job_id=None, train_run_id=None, polling_interval=None, client=None, poll_on_creation=True)[source]¶ Encapsulates asynchronous execution of a CivisML job
This object knows where to find modeling outputs from CivisML jobs. All data attributes are lazily retrieved and block on job completion. This object can be pickled.
Parameters: job_id : int
ID of the modeling job
run_id : int
ID of the modeling run
train_job_id : int, optional
If not provided, this object is assumed to encapsulate a training job, and
train_job_id
will equaljob_id
.train_run_id : int, optional
If not provided, this object is assumed to encapsulate a training run, and
train_run_id
will equalrun_id
.polling_interval : int or float, optional
The number of seconds between API requests to check whether a result is ready. The default intelligently switches between a short interval if
pubnub
is not available and a long interval forpubnub
backup if that library is installed.client :
civis.APIClient
, optionalIf not provided, an
civis.APIClient
object will be created from theCIVIS_API_KEY
.poll_on_creation : bool, optional
If
True
(the default), it will poll upon callingresult()
the first time. IfFalse
, it will wait the number of seconds specified in polling_interval from object creation before polling.See also
civis.futures.CivisFuture
,civis.futures.ContainerFuture
,concurrent.futures.Future
Attributes
metadata (dict, blocking) The metadata associated with this modeling job metrics (dict, blocking) Validation metrics from this job’s training validation_metadata (dict, blocking) Metadata from this modeling job’s validation run train_metadata (dict, blocking) Metadata from this modeling job’s training run (will be identical to metadata if this is a training run) estimator ( sklearn.pipeline.Pipeline
, blocking) The fitted scikit-learn Pipeline resulting from this model runtable ( pandas.DataFrame
, blocking) The table output from this modeling job: out-of-sample predictions on the training set for a training job, or a table of predictions for a prediction job. If the prediction job was split into multiple files (this happens automatically for large tables), this attribute will provide only predictions for the first file.state (str) The current state of the Civis Platform run job_id (int) run_id (int) train_job_id (int) Container ID for the training job – identical to job_id
if this is a training job.train_run_id (int) As train_job_id
but for runsis_training (bool) True if this ModelFuture
corresponds to a train-validate job.Methods
cancel() Cancels the corresponding Platform job before completion succeeded() (Non-blocking) Is the job a success? failed() (Non-blocking) Did the job fail? cancelled() (Non-blocking) Was the job cancelled? running() (Non-blocking) Is the job still running? done() (Non-blocking) Is the job finished? result() (Blocking) Return the final status of the Civis Platform job. -
add_done_callback
(fn)¶ Attaches a callable that will be called when the future finishes.
- Args:
- fn: A callable that will be called with this future as its only
- argument when the future completes or is cancelled. The callable will always be called by a thread in the same process in which it was added. If the future has already completed or been cancelled then the callable will be called immediately. These callables are called in the order that they were added.
-
cancel
()¶ Submit a request to cancel the container/script/run.
Returns: bool
Whether or not the job is in a cancelled state.
-
cancelled
()¶ Return True if the future was cancelled.
-
done
()¶ Return True of the future was cancelled or finished executing.
-
exception
(timeout=None)¶ Return the exception raised by the call that the future represents.
- Args:
- timeout: The number of seconds to wait for the exception if the
- future isn’t done. If None, then there is no limit on the wait time.
- Returns:
- The exception raised by the call that the future represents or None if the call completed without raising.
- Raises:
CancelledError: If the future was cancelled. TimeoutError: If the future didn’t finish executing before the given
timeout.
-
failed
()¶ Return
True
if the Civis job failed.
-
result
(timeout=None)¶ Return the result of the call that the future represents.
- Args:
- timeout: The number of seconds to wait for the result if the future
- isn’t done. If None, then there is no limit on the wait time.
- Returns:
- The result of the call that the future represents.
- Raises:
CancelledError: If the future was cancelled. TimeoutError: If the future didn’t finish executing before the given
timeout.Exception: If the call raised then that exception will be raised.
-
running
()¶ Return True if the future is currently executing.
-
set_exception
(exception)¶ Sets the result of the future as being the given exception.
Should only be used by Executor implementations and unit tests.
-
set_result
(result)¶ Sets the return value of work associated with the future.
Should only be used by Executor implementations and unit tests.
-
set_running_or_notify_cancel
()¶ Mark the future as running or process any cancel notifications.
Should only be used by Executor implementations and unit tests.
If the future has been cancelled (cancel() was called and returned True) then any threads waiting on the future completing (though calls to as_completed() or wait()) are notified and False is returned.
If the future was not cancelled then it is put in the running state (future calls to running() will return True) and True is returned.
This method should be called by Executor implementations before executing the work associated with this future. If this method returns False then the work should not be executed.
- Returns:
- False if the Future was cancelled, True otherwise.
- Raises:
- RuntimeError: if this method was already called or if set_result()
- or set_exception() was called.
-
succeeded
()¶ Return
True
if the job completed in Civis with no error.
-