FAQ for Data Scientists
We have talked to many Data Science teams and here is what they often ask about
There is industry-wide confusion about what “Feature Engineering” is in the first place.
Adding “automated” to it probably makes things even worse.
A short definition (which probably makes sense only for those who already know it) could be:
Feature Engineering is the process of building a Modeling Dataset
for Machine Learning.
For those who want to know more details - it's best to answer in 4 steps. Read on.
That’s because Modeling Datasets consist almost exclusively of “Features”. Other names for Features that have been used over the years are: variables, independent variables, input columns, factors (and probably more). The only additional column apart from Features in the Modelling Dataset is the Target column - sometimes called a dependent variable.
Why “Engineering”? Because building Features is a quite complicated process and requires a custom design work highly dependent on the source data. By “source data” we mean whatever data is available - typically on a relational database BEFORE any work is done on the Predictive Analytics project. These databases come in various shapes and sizes (called data models) but what they have in common is that the data they contain is not digestible by Machine Learning. Those databases contain events, signals - a record-by-record history of what happened and those records are not “Features”.
Someone has to transform - i.e. engineer those records into “Features” - hence Feature Engineering.
By no means is there a consensus on the term Feature Engineering.
You can also find the following terms used interchangeably when discussing Predictive Analytics projects:
- ETL (Extract Transform Load)
- ELT (Extract Load Transform)
- Data Integration
- Data Munging
- Data Wrangling
… and more.
In the case of Xpanse AI - probably a better description would be “Autonomous” Feature Engineering.
It's a cutting-edge development in AI, where large parts of manual effort are replaced by clever tech.
Xpanse AI taps to the relational databases and extracts Features autonomously - i.e. on its own.
It’s a multi-step process happening pretty much in a similar way as humans used to do it - just on a massively larger scale and much faster.
A Modeling Dataset is a multi-column table organized in a way that is acceptable by Machine Learning.
If you are in this section you probably have heard about the Iris dataset or the Titanic. Those are examples of Modeling Datasets.
Fun fact: Modeling Datasets don't freely exist in commercial environments. They have to be constructed separately for each Predictive Analytics project.
Many call the the process of constructing Modeling Datasets “Feature Engineering”.
Firstly - nothing stops you from combining the approaches. You can ask the machine to extract features for you and also add your own to the mix.
There are a couple of reasons why automation is beneficial:
- Speed: a computer can comb through thousands of potential features in minutes, comparing to weeks that humans would need to spend checking out only a few hundred.
- Depth of search: the machine can test the ideas that you would never even look at because of time constraints.
- Lack of bias: we come to work with our own ideas about what can be correlated with the target and what might not, hence we give more attention to certain directions while ignoring others. This may result in an unwanted biases in the models. The machine is an equal opportunity researcher.
Xpanse Insights is aimed at professionals primarily responsible for the business problem. If answers to improving your KPIs are hiding in the data - you should give us a call.
Xpanse AI is razor-focused on Predictive Analytics. It does require a bit of knowledge from your side about your data. Also - a basic understanding of how to read the accuracy of Predictive Models would be useful - but we can always help you with that during short training.
No, you don’t!
This is the major benefit of working with Xpanse AI.
We have clients from different industries with entirely different data models loading up their databases and Xpanse AI allows to analyze them all within a few clicks.
The main difference is in speed of delivering the entirety of the Predictive Analytics project.
Xpanse AI can drive you from the beginning of the project to deployable models within few hours.
How is this possible? Best to answer this question by comparing to several other tools.
H2o & Datarobot
H2o & Datarobot require a well prepared Modelling Dataset as an input. Someone has to build it first.
Xpanse AI doesn't require a Modeling Dataset, its starting point is a relational database
Trifacta, Alteryx & SAS
Those platforms are visual or scripting tools where you can create a data flow to convert your relational database into a Modelling Dataset. You do it by hand-crafting each aggregate (i.e. Feature). This takes awful lot of time, it's very repetitive and limited in the number of Features you can build and test. Xpanse AI is executing this process autonomously at the scale that those tools can't.
Featuretools requires a detailed understanding of the source data and making decisions about what particular aggregates you want to build. Xpanse AI figures all that out autonomously saving you a lot of effort.
Plus, Featuretools is focusing on only one step of the entire process while Xpanse AI is an end-to-end Predictive Analytics platform.
This one again...
Missing Values are a problem sometimes incurred in Modeling Datasets.
However, the starting point for Xpanse AI are databases full of relational tables - not a Modeling Dataset.
We customize the price to meet your needs.
Some of our clients use our product once a week. Others use it every day.
Some users use large databases requiring substantial computing resources to deal with, while others prefer to carve out small chunks of data relevant for their projects.
We want to make sure that you get the most value for your money without overpaying and that’s why we don’t have one-size-fits-all pricing model.
Contact us at email@example.com