Getting started

Create a release

With DataQA you can share a dataset with model output predictions with the wider team for quality assessment.


Basics: what is a release?

A DataQA project can have one or multiple releases. A release is associated with a single dataset and includes all the analysis related to that dataset. A release will have its own url of the form https://app.dataqa.ai/release/RELEASE_ID which can be shared among your team members, so they can explore the data. One release can have one or multiple chart groups. A chart group is a piece of analysis coming from a single data query, for example "show the distribution of predictions grouped by country". This query will produce a set of charts which will make a chart group.

Preparation

We assume the data scientist has a dataset in tabular format ready to be shared. At the moment, only the following column types are supported:

  • categorical with at most 20 categories,
  • numerical,
  • text,
  • dates
#example dataframe

data = [
  [1, 2, 4],
  ['Spain', 'United Kingdom', 'France'],
  ['Product 1001', 'Product 200', 'Product 800'],
  ['2019/01/01', '2010/06/02', '2020/11/10'],
  [1000, 2000, 9000],
  [800, 500, 8900]
]
columns = [
  'customer_id',
  'country',
  'product_name',
  'last_purchase',
  'spending',
  'predicted_spending'
]

df = pd.DataFrame(zip(*data), columns=columns)

Example:

customer_idcountryproduct_namelast_purchasespendingpredicted_spending
1SpainProduct 10012019/01/011000800
2United KingdomProduct 2002010/06/022000500
4FranceProduct 8002020/11/1090008900

The data scientist is ready to evaluate the quality of her predictions. In the example above, predicted_spending holds the predictions for the column spending, which contains the actual ground-truth values.

Creating a release

Whenever a new dataset gets published, a release is created inside a project. This means that an interface application is created where users can explore and perform operations on the dataset, such as filtering or grouping, and visualise different metrics.

Fastest

The fastest way to create a release is using the python client library. PROJECT_ID is the hash string that you can find in the project page (see Getting started).

from dataqa import DataQA

dataqa = DataQA()
dataqa.login()

# Prompt username and password

dataqa.publish(PROJECT_ID, df)

Running the command dataqa.login() will prompt the user to enter their username and password. dataqa.publish() creates a release associated to the project previously created with the data inside the variable df.

This command will print a link of the form https://app.dataqa.ai/release/RELEASE_ID where you can find a QA app with the data already loaded.

In this way of using DataQA, the engine needs to infer the type of data that is being sent. You can find more information about schema inference here. Additionally, the engine assumes there are no columns which are outputs of machine learning predictions. The app will not compute and create any machine learning metrics on this dataset.

Preferred

In the preferred way of using DataQA, the user would define the schema of the data that is getting published. This allows the app to visualise data according to its type, and optionally compute machine learning metrics.

from dataqa import DataQA
from dataqa.column_mapping import ColumnMapping, PredictionColumn

numerical_columns = ["spending", "predicted_spending"]
categorical_columns = ["country"]
text_columns = ["customer_id", "product_name"]
time_columns = ["last_purchase"]

prediction_columns = [PredictionColumn(prediction_column="predicted_spending",
                                       ground_truth_column="spending",
                                       task="regression")]

column_mapping = ColumnMapping(numerical_columns=numerical_columns,
                               categorical_columns=categorical_columns,
                               text_columns=text_columns,
                               time_columns=time_columns,
                               prediction_columns=prediction_columns)

dataqa = DataQA()
dataqa.login()

# Prompt username and password

dataqa.publish(PROJECT_ID, df, column_mapping)

As above, this command prints a link of the form https://app.dataqa.ai/release/RELEASE_ID where you can find a QA app with the data already loaded.

The library has an utility function that performs the inference and returns a ColumnMapping object which can be passed to the publish function after doing the necessary modifications.

from dataqa.infer_schema import infer_schema

infer_schema(df)

In our case, it would return the following column mapping:

ColumnMapping(
  numerical_columns=['customer_id', 'spending', 'predicted_spending'],
  categorical_columns=['country', 'product_name', 'last_purchase'],
  text_columns=[],
  time_columns=[],
  prediction_columns=None
)
Previous
Installation