Machine Learning Applications: 55+ and counting

Machine Learning Applications: 55+ and counting

If you’ve ever wondered how machine learning applies to your industry, this is the guide for you.

In this guide, I explain 55 different real-world machine learning applications.

I cover 12 industries, and while you’ll see the apparent applications (Google, Facebook, Amazon), I sprinkled them with a few non-obvious examples.

Within each example, you’ll get my in-depth analysis of the technology that powers it, including:

  • Who’s doing it
  • The type of data you would need
  • The algorithms probably used
  • The impact so far

Hopefully, it will help you understand more about the technology that currently powers our everyday life, and sparks some ideas on how to apply machine learning to your work.

Ready to get started?

Scroll down to read more (and if you’re in a rush, use the Table of Contents for reference to the specific industry you’re interested in).

This page is a growing list, so if there are others you come across, please email us, and we’ll add them in.

First and foremost: why do we care about business applications of machine learning?

Nobody starts from scratch. The smartest people tend to steal ideas. Of course, execution is always critical, but the goal of this page is to give you a starting point.

After doing this for the last 11 years, I see two major camps:

  • Data scientists who know all about machine learning and technology and can’t seem to make a business impact.
  • Executives who know the business but can’t seem to figure out how to bring ML / AI / DL to their organization.

Start with the table of contents, look at an industry, and see what is successfully applied.

From there, you can see what others do and, namely, how they bridge the gap from theory to practice.

Using Machine Learning to Improve Business and Marketing

When it comes to business and marketing, ROI (return on investment) is everything.

Every activity—including machine learning projects—needs to be positioned in a way to help their customers (i.e., businesses) grow, operate, and run their companies more efficiently.

The sky’s the limit when it comes to applying machine learning, and a few common examples include the following:

Let’s break them down one by one.


Predicting Customer Lifetime Value (CLTV)

A diagram of how companies might think about customer churn.
A diagram of how companies might think about customer churn.

Who’s doing it?

Just about every single SaaS (software as a service) company, such as ClickFunnels and Buffer.

Many companies make this calculation based on the sum of a customer’s total purchases over time. Still, the problem is it doesn’t predict the value of a new incoming customer.

So, if you’re calculating CLTV, you probably want to utilize a predictive measure vs. a historical one.

What type of data do you need?

Start with data from your accounting department, and then work from there to build a model.

Some data points you need to calculate CLTV include:

  • Churn rate
  • Average monthly recurring revenue
  • Average order value
  • The average length of customer engagement
  • Customer acquisition cost

Algorithms used?

It depends on the company, but I like to start with a simple linear regression first—either using log transforms or dummy variables.

If you need to get more advanced, you can do things like building conditional probability tables, attribute tables, and visualizations.

What’s the impact on the industry?

If a company can determine its customer’s lifetime value, they have a functional basis for tweaking and operating the business.

For example, if a customer is worth $5,000, and it costs the company $1,000 to acquire the sale, your profit margin is $4,000. By knowing this information, a company can make a more accurate calculation on how much to spend on advertising and who their most loyal customers are (and primarily, how to prioritize their resources).


Emotion / Sentiment / Tone Analysis

Grammarly's email tone detector on an email I composed.
Grammarly’s email tone detector on an email I composed.

Who’s doing it?

Many SaaS apps analyze things for emotion and sentiment, but most recently, Grammarly analyzed my emails for tone (See above). 

What type of data do you need?

Data needed is text, documents—lots of words.

Grammarly analyzes text for writers on a minute-by-minute basis, so they mostly took the document and examined it for some pure emotions: happy, sad, angry, positive, neutral, etc.

Which algorithm?

When I was building sentiment engines in 2012, we would use support vector machines. Nowadays, I like to use (and am pretty sure Grammarly is using) bi-directional long short-term memory, otherwise known as LSTMs.

Before you go out and do this, however, check out the APIs on IBM Watson. It might save you a few hours of work if you want to do this.

What’s the impact on the industry?

For Grammarly, the impact for writers is substantial because now they know what to change in their writing based on visual cues (in this case, an emoji).

But sentiment analysis can be used in other ways. So, for example, if you were a brand like Ford or Patagonia, you could quickly run a sentiment analysis on all of your social media comments to see how well your new products or marketing campaigns are received and how to course correct from there.


A/B Split Testing

Bar chart depicting how A/B testing works.
Bar chart depicting how A/B testing works.

Who’s doing it?

Optimizely originated this idea based on testing which hero image converted better on the Obama 2012 donation webpage.

What type of data do you need?

To get something statistically significant, you would need a large sample size of users going through 2 different versions of something and seeing if one is converting better than the other.

Which algorithm?

Using a chi-squared test, you can figure out the difference between version A or version B. So; if A gets 100,000 clicks and B gets only 10,000 clicks, then A is better than B.

What’s the impact on the industry?

Once Optimizely showed what was possible with A/B tests, organizations are now experimenting with every decision. Instead of guessing what will convert better on a page, someone can collect data from their users to figure out what the “winner” is.


Email Personalization

A personalized thank you email from Lady Antebellum (via Spotify).
A personalized thank you email from Lady Antebellum (via Spotify).

Who’s doing it?

Most marketing automation platforms like SailThru, Exact Target, and Iterable offer this as part of their technology.

What type of data do you need?

Information on what your users are opening, clicking, and doing as a result of receiving your message – things like open rates, clickthrough rates, and unsubscribe rates.

Which algorithm?

I can only speak for SailThru, but basically, they collect data on whoever opens and clicks on a message, then fit a classifier on it and calculate a probability score.

I guess that they’re using linear regression, but it also could be random forests or decision trees. Regardless, it doesn’t necessarily matter because it’s a supervised learning problem, which means trying to figure out what’s going to happen in the future based on events that occurred in the past.

What’s the impact on the industry?

Personalization is now a standard for any messaging a brand uses to engage with their customers. Each contact costs companies money, so knowing what content to send to who is just smart business practice.


Summarizing Legalese

How legalese gets detected (and corrected) with software.
How legalese gets detected (and corrected) with software.

Who’s doing it?

LegalRobot.

What type of data do you need?

Since they’re summarizing legalese, the data you need is legal documents. In other words, lots of text, as well as lawyers for legal expertise. Remember, you can’t train a computer to do something without a human as a model.

Which algorithm?

I don’t know for sure, but I guess that they used topic modeling (LDA, LSA, or TF-IDF) or newer technology like word2vec, Glove, or Elmo.

Whatever it is, each of these models can be used to derive semantic meaning out of the text.

What’s the impact on the industry?

Legalese is practically another language, and hiring a lawyer for $400/hr isn’t realistic sometimes. So technology can help bring down the cost of reviewing a legal document and empower non-lawyers to feel comfortable negotiating their legal issues.


Entertainment, Music, and Movies

Before the technology was available, we would have to choose TV shows, movies, and music painstakingly.

Now, machine learning helps us choose.

Entertainment companies are mostly delivering recommendations tailored to a user’s past behavior, therefore “learning” a pattern and then deciphering what else a user might like.


Playlist generation

A radio playlist generated by Spotify based on a users listening
A radio playlist generated by Spotify based on a user’s listening habits.

Who’s doing it?

Spotify

What type of data do you need?

To do this, you need data from an individual user. You will have to record and track the user’s search history and clicks.

Which algorithm?

I guess that Spotify is using an ontological approach that searches based on metadata as well as listening patterns. If you were to search for an artist, it would be assigned to a genre and deliver additional new music to you in the same or adjacent styles.

Additionally, Spotify probably finds other music that’s highly rated by different individuals in a similar way to collaborative filtering and sorting.

What’s the impact on the industry?

For a music lover, this is great. It offers new music to listen to that is similar enough to the music you like. You get the benefit of listening to what you want, as well as discovering things you might like.

Exploration vs. exploitation is a very integral point behind machine learning. The algorithm wants you to keep listening but also needs to deliver things that keep the playlist from being the same thing over and over (although I won’t judge if someone listens to the same song 100 times in a row).


Movie Recommendations

The movies Netflix recommends based on someones viewing of Mad Max: Fury Road.
The movies Netflix recommends based on someone’s viewing of Mad Max: Fury Road.

Who’s doing it?

Many are doing this today, including Amazon, Hulu, and Netflix. Netflix popularized this approach, cementing it with the Netflix Prize.

What type of data do you need?

Making movie recommendations has two critical aspects: what they are watching and what they want.

Which algorithm?

The algorithm used by Bellkor Programmatic Chaos to win the Netflix Prize was a variant of Stochastic Gradient Descent (The Bellkor Algorithm). Realize that it’s been ten years since the Netflix Prize ended, and most likely, someone would pull out Keras, PyTorch, or one of the deep learning libraries to yield something more substantial. Adam is a better optimization algorithm.

What’s the impact on the industry?

Now companies can figure out what users are willing to watch and not. More engagement equals less user churn.


Cognitive Movie Trailer

How movies like Suicide Squad, Zootopia, and Fifty Shades of Grey get dissected for analysis
How movies like Suicide Squad, Zootopia, and Fifty Shades of Grey get dissected for analysis. Image credit: ARVIX

Who’s doing it?

20th Century Fox in partnership with IBM Watson

What type of data do you need?

Training data. In this case, scientists and researchers used 100 different horror movies and segmented out scenes in each of the movie trailers.

Which algorithm?

It appears that what they did is score each scene in terms of audio, visual, and general tone and sentiment. So audio can be classified into tones based on sounds similar to feeling if given ground truth, same with visual examples. Then the general mood, I believe, was done via the natural language from the scenes. From that, they looked for the best view that captured what they wanted in a trailer.

They were still giving it some human intervention, although it looks good!

What’s the impact on the industry?

Turning weeks of work into a mere few days. According to IBM Watson . . .

Traditionally, creating a movie trailer is a labor-intensive, completely manual process. Teams have to sort through hours of footage and manually select every potential candidate moment. This process is expensive and time-consuming, taking anywhere between 10 and 30 days to complete. From a 90-minute movie, our system provided our filmmaker a total of six minutes of footage. From the moment the system watched “Morgan” for the first time to the moment our filmmaker finished the final editing, the entire process took about 24 hours.

In-store and Online Retail

The goal of any retail or e-commerce business is to get customers to buy more.

Similar to entertainment, retail, and e-commerce companies are interested in recommending items to customers identical to what was previously purchased.


E-commerce Recommendation Engines

Looking for a toilet seat? Amazons recommendation engine has you covered (Pun intended).
Looking for a toilet seat? Amazon’s recommendation engine has you covered (Pun intended).

Who’s doing it?

Most famously, Amazon.

What type of data do you need?

I don’t know for sure, but my guess is they’re using browsing session data. They probably believe that if someone clicks on a toilet seat, they’ll click on another one to comparison shop.

Which algorithm?

Collaborative filtering is the canonical example for recommendation engines, although there is also the universal recommender, which yields excellent results as well.

What’s the impact on the industry?

It’s simple – when customers see more, they buy more, especially if it’s related. A lot of the time, when we buy toilets, we want a toilet seat to go with it. It’s that simple, really: more views equals more money.

Sidenote: I can’t share the details of how I did it, but I finally received my patent for the recommendation engine I built at Ritani.com.


Online Personal Shopping

StitchFix finetunes their algorithm by asking users to rate whether items in inventory are "their style."
StitchFix finetunes their algorithm by asking users to rate whether items in inventory are “their style.”

Who’s doing it?

StitchFix

What type of data do you need?

When StitchFix first launched, they asked users to fill out a profile asking what fabrics, colors, and price points they liked.

But nailing style is a tricky thing to do, let alone online. So StitchFix continually offers quizzes to the user to make sure personal stylists are sending the right items to customers.

Which algorithm?

They are most likely using some riff on collaborative filtering like the universal recommender. If someone bought this and liked it, that would get a +1 that would turn into a matrix of occasions when someone bought many different items and loved them. So you could say that someone who likes Barney’s shirts might also like Comme des Garçons shirts.

What’s the impact on the industry?

Well, StitchFix IPO’ed in 2017, so they’ve gotten the attention of investors by combining fashion with algorithms. They’re continually expanding their machine learning capabilities. Still, it remains to be seen if they can compete with e-commerce giants like Nordstrom and Amazon, who also offer online personal shopping services.


How fashion photography can be dissected and analyzed using machine learning.
How fashion photography can be dissected and analyzed using machine learning.

Who’s doing it?

Heuritech, a startup based out of France, is currently doing this.

What type of data do you need?

Lots of photos, and probably some validation data (number of likes/comments) on social media.

Which algorithm?

Most likely, they’re using a convolutional neural net to come up with features from photos and aggregate them together into clusters. My assumption is they used their own training set vs. something general, such as Amazon’s Rekognition.

What’s the impact on the industry?

It’s a significant win for fashion houses Louis Vuitton and Christian Dior, who are Heuritech’s new customers. According to their CEO,

“By capturing and analyzing these new voices in real-time, our AI technology allows brands to predict fashion trends months before they happen, adapt their designs, and better plan their merchandise mix.”

Pretty significant impact.


Pregnancy Prediction

Thanks to machine learning, Target discovered newly pregnant women usually purchase unscented lotion.
Thanks to machine learning, Target discovered newly pregnant women usually purchase unscented lotion.

Who’s doing it?

Most famously, Target. They knew that companies used birth records to target new parents but wanted a way to predict pregnancy to target customers, which resulted in their sending baby coupons to a teenager in Minnesota.

What type of data do you need?

They analyzed the types of products customers purchased during specific points of pregnancy, using their GuestID and baby registry information.

Which algorithm?

Most likely, the result is more of an artifact. I suspect that an algorithm detected that either a woman was buying pregnancy tests or was not buying the same amount of feminine hygiene products (i.e., missed a menstruation cycle), and the algorithm said, “Oh yes, this person is pregnant.”

The funny thing is that it might only be an 80% probability (meaning 1/5 is wrong), and maybe someone isn’t pregnant. But those who aren’t pregnant probably would ignore the mailing. If I were to pick an algorithm, it’d either be some variant of XgBoost or CatBoost, or, perhaps, an LSTM.

What’s the impact on the industry?

It sets the standard for identifying critical points in a customer lifecycle and marketing to them as such. But companies do have to be careful: no customer wants to feel like they’re being spied on, so they must make the offers “look random” to some extent.

Internet of Things and Smart Devices

IoT is a hot topic and it’s combination with machine learning is a match made in heaven.


Smart Thermostats

The Nest Thermostat learns user patterns to optimally heat (or not) your house or apartment.
The Nest Thermostat learns user patterns to optimally heat (or not) your house or apartment.

Who’s doing it?

Most famously, Nest, but now there are similar devices made by Ecobee and Honeywell.

What type of data do you need?

Users must turn the knob when adjusting the thermostat. Over time, a smart thermostat will learn when you like it to be colder or warmer, based on the time of day.

Which algorithm?

Most likely, they are utilizing some seasonal model that determines dips and valleys. STL is a standard decomposition which decomposes Season, and Trend using Loess is a regression technique.

You could also achieve this in other ways with Q-Learning, based on the time of day and day of the week. There are a lot of ways of doing this, like using ARIMA.

What’s the impact on the industry?

I can’t speak for others, but when I lived in an older home, Nest made me feel much more at ease when I received my oil bill (especially when I left for work and forgot to turn it off). Now homes can be heated more efficiently based on a user’s pattern.


Voice detected personal assistants

Thanks to voice activation technology, more people can afford "personal assistants."
Thanks to voice activation technology, more people can afford “personal assistants.”

Who’s doing it?

Amazon Alexa, Microsoft’s Cortana, and Apple’s Siri are three of the major ones we use every day.

What type of data do you need?

The training set is a WAV file or MP3 and transcription using something like SRT to sync up words with the audio.

Which algorithm?

Most voice detection tools utilize recurrent neural networks, with a ton of training data, as well as some pre-processing tricks. I have heard of using ICA (Independent Component Analysis) or Fast Fourier Transformations to get components out of a sound.

Sound is incredibly complicated because there’s so much happening at once. Microphones can detect that someone is playing an instrument, talking, and doing the laundry all at the same. Separating only the person speaking is difficult. Then once you have done that, running that through an RNN and detecting what someone is saying (or let alone what language they are using) is also tricky.

What’s the impact on the industry?

I don’t use voice-activated personal assistants (another story for another day), but my in-laws love that they can ask for what they want without having to push a button. For most people, it made it easier to buy goods, listen to the music they want, and set a timer while cooking. Who knows what else you can do with them in the future?


Auto-Driving Automobiles

A visualization of how self-driving cars know what to plow past (and what to avoid hitting).
A visualization of how self-driving cars know what to plow past (and what to avoid hitting).

Who’s doing it?

Tesla, of course, and Google.

What type of data do you need?

The data collected with this is semi-supervised. In the beginning, most likely, someone bootstrapped this problem by attaching a camera to the dash of their car to collect data and then started running some computer vision models on it to see if they could separate what is what.

These computer vision models most likely were transferred from something else. There are datasets like COCO from Microsoft, which show object distinctions that people use to train up a model and then apply in new spaces.

Which algorithm?

Recurrent neural nets or RNNs are taking a camera feed and detecting objects and then making decisions based on that.

What’s the impact on the industry?

From a pure human capital standpoint, self-driving cars would reduce the cost of getting goods from one place to another and (most likely) reduce the number of accidents that happen from human error.

So while the vision is there, so far, we only have autopilot mode on Tesla, which makes it so my brother-in-law can change lanes automatically on a Southern California highway.


Smart Energy Meters

Sense records your home energy use so you can (ideally) optimize your energy bills.
Sense records your home energy use so you can (ideally) optimize your energy bills.

Who’s doing it?

Sense.

What type of data do you need?

A time-series ammeter connection to your circuit breaker is needed to sense changes in electricity usage. From there, Sense will track Fourier transforms on a time-series of current to determine the components out of the oscillations. Fourier Transforms is the same thing that powers MP3 and MP4s. It works well as a lossy compression but also helps determine the component frequencies.

Which algorithm?

Analyzing energy is most likely done through some classification problem that takes frequency, domain, amplitude, and other parts and maps that onto their existing database. This analysis could also be a straight search that returns the highest value over a threshold.

What’s the impact on the industry?

Saving energy is never a bad thing, especially if you know which devices are sapping energy (but aren’t being used).


Self-driving Vacuums

How a Roomba might orient itself to vacuum a room.
How a Roomba might orient itself to vacuum a room.

Who’s doing it?

Roomba.

What type of data do you need?

Most likely, what the new Roomba is doing is some multi-agent learning model that will explore the room initially to get a sense of the room boundaries (which could be many algorithms, but I like collaborative diffusion as a process).

When Roomba learns a map, the Roomba can start to mop and sweep the floor while trying to get out of other objects’ way.

Which algorithm?

The new 980 model utilizes reinforcement learning.

What’s the impact on the industry?

Well, when Roomba first came out, they were pretty stupid and ran into a lot of things (including dog poop). However, people don’t like to vacuum, so the technology will most likely get better as long as people want it.

Cloud, APIs, and Software

Machine learning applications in this section are services that weren’t possible before the Internet but are now a standard in how we query information about a product or service.


Automated Translation Service

Google Translate can take French words and turn them into Icelandic (or any other combination of languages you could ever want).
Google Translate can take French words and turn them into Icelandic (or any other combination of languages you could ever want).

Who’s doing it?

Google Translate.

What type of data do you need?

Text. Lots of it.

Which algorithm?

Old-school translation models utilize heuristics, but many hours were needed to fine-tune it. Now, deep learning can be used to fix some of the sequences to sequence problems using RNNs and LSTMs.

What’s the impact on the industry?

Now you don’t need bilingual individuals to translate from one language to another. The technology isn’t perfect, but it can get most of us pretty far (especially if we’re traveling on vacation).


Customer Success Bots

How meal delivery service SunBasket uses bots to automate customer service.
How meal delivery service SunBasket uses bots to automate customer service.

Who’s doing it?

Many companies are rolling in customer success bots, such as Comcast and AT&T. Before they route you to an actual human being, you’re most likely talking to a computer to route your phone call or email inquiry.

So, for instance, when you say to Comcast, “How do I pay my bill?” you get directed to a page, not to a customer service person.

What type of data do you need?

The data used to make this happen most likely some ontological dialogue dataset, i.e., a conversation between two individuals.

The dialogue could be specific to an industry or chats from support agents in the past.

Which algorithm?

There are a whole host of algorithms in the NLU (Natural Language Understanding) and NLP (Natural Language Processing) fields. Effectively underneath the hood algorithms are utilizing neural nets, HMMs, and topic models like LDA or LSA to determine what people are most likely saying and what they want. The bread and butter of NLP and NLU are entity extractions or recognition, as well as POS (part of speech) tagging.

What’s the impact on the industry?

Pretty significant. It costs a lot to run a call center. For years, companies have been outsourcing to places like India and the Philippines, although even that is still pricey.

So to reduce costs, labor has to get cheaper. By letting robots do some of the work, it can free up resources for humans to do the hard, non-automated stuff.

Last year, I heard that Amazon is trying to predict implicit intent, i.e., trying to predict what you want before you ask for it explicitly. So, if you asked your friend if he or she is going to hang out later, based on your history of hiring Ubers to see your friend, the bot could determine that you need a car and do so without you having to ask.

It could be creepy or revolutionary. Only time will tell.


Google Autocomplete

According to Google Autocomplete, a lot of people have questions about the famous celebrity "Larry."
According to Google Autocomplete, a lot of people have questions about the famous celebrity “Larry.”

Who’s doing it?

Most famously, Google. Type in “How to,” and you’ll see a series of suggestions or auto completions.

What type of data do you need?

You need training data from user searches that happen with time.

Which algorithm?

I’m not sure exactly how Google does this, but most likely, they are using some deterministic finite automata (DFA). DFAs are the same algorithm that powers regular expressions.

DFAs are very similar to decision trees. When the user types “how,” they get into the “how” section of a DFA or tree. From here, there are many possible outcomes.

What’s the impact on the industry?

It’s a gold standard now for Amazon and other e-commerce sites, plus, it’s easy to do thanks to Elastic’s implementation.

From here, an answer or auto-completion would be a terminal node. But which leaf to respond with, and in what order? That is a problem for machine learning. So for each node, there is a certain amount of occurrences over an exponential decayed time. You can then basically rank that way.


Semantic Similarity Engines

Semantic similarity is simply understanding how words and terms connect together.
Semantic similarity is simply understanding how words and terms connect together.

Who’s doing it?

Some companies have utilized this for years like Airbnb, although there are more modern ones like Anghami.

What type of data do you need?

You need words as inputs to train. So if you’re looking for a bungalow on Airbnb to stay in for your next vacation, Airbnb can recommend a rambler to you as well.

Which algorithm?

Using a technique called word2vec or representation learning, you can take an extensive vocabulary and start to learn its vectorization. The vectorization can then be utilized to find similar words very quickly.

word2vec is incredible for simple calculations like Queen = King – Man + Woman. You can do geometry on words!

What’s the impact on the industry?

It adds another level of dimension to the usual recommendation engine.

Hotels, Airlines, and Travel

Before the Internet, you would use a travel agent to book a trip.

Now? You can go on the Internet and comparison shop for a plane ride, hotel, and rental car with a few clicks of the button, thanks to the power of machine learning.

The demand for cheaper travel is always there, so it’s in the travel industry’s interest to experiment with machine learning to give customers what they want while making sure that they maximize their efficiencies as much as possible.


Dynamic Pricing

Hotels like Starwood using dynamic pricing to figure out when to raise or lower room rates.
Hotels like Starwood using dynamic pricing to figure out when to raise or lower room rates.

Who’s doing it?

Airbnb and Starwood Hotels apply dynamic pricing.

What type of data do you need?

At the bare minimum, you would need past data on bookings. Airbnb probably also uses geolocation as a factor since properties in NYC are reasonably more expensive than in Kansas City.

For Starwood, they specifically use the following:

  • past and present reservation data
  • booking pattern data
  • cancellation and occupancy data
  • room type
  • daily rate data
  • transient or group status (whether you’re a solo traveler blowing into Manhattan for a couple of days or a gang of 100 coming in for a convention)
  • external data, such as competitive pricing, weather, and climate data and booking patterns on other sites

Which algorithm?

Most likely, they are using some seasonal and time-series-based regression. It could be generalized autoregressive conditional heteroskedasticity (GARCH) or one of the modern ensemble methods.

Regardless, they want to predict with high enough accuracy (i.e., low enough error) that demand will be within a given constraint.

What’s the impact on the industry?

If you’re renting out property on Airbnb, it’s a positive impact! Instead of randomly guessing, you have some data and a technology stack that helps you maximize the rate you set for your room or house rental.

For Starwood, it’s probably helped them maximize their profits. But for a user, it means that you can refresh the page and magically see a higher price than you did 5 minutes ago, which can be slightly annoying.


Predicting Lost Connections in Airlines

If airlines can predict missed connections, they can reroute planes more efficiently (and potentially avoid a mob of angry travelers).
If airlines can predict missed connections, they can reroute planes more efficiently (and potentially avoid a mob of angry travelers).

Who’s doing it?

Airlines like Delta, American, and others.

What type of data do you need?

According to this presentation, the machine learning model utilized:

  • Source data from the US Department of Transportation on flights and flight delays
  • Equipment information
  • Flight schedules
  • Total flight time
  • Airport locations
  • Flight date (month, day of the month, day of the week)
  • Daily weather data

Which algorithm?

Bagging was found to work reasonably well, although I would suspect that boosting would also work since it’s an ensemble method.

Overall, I believe this supervised learning type model could be solved using XgBoost or CatBoost, which are incredible packages.

What’s the impact on the industry?

Airlines are responsible for rerouting passengers if they miss a connection, which can cost the airline millions of dollars a year.

Having the ability to predict missed connections means that the airline can adjust in real-time to these issues, either by provisioning an extra plane, leaving early, or leaving later.


Traffic patterns and map directions

Thanks to Waze, we have a better picture of when to start driving and what roads to avoid.
Thanks to Waze, we have a better picture of when to start driving and what roads to avoid.

Who’s doing it?

Waze is an iOS app that updates traffic patterns in real-time. Users get the fastest way every time they input directions to a location.

What type of data do you need?

Minimum spanning trees are a solved problem by Dijkstra. The problem is that we don’t know what the weight is between nodes in a graph.

If you have a graph of all possible decision points on a trip (nodes), you could quickly figure out what the shortest path is between A and B, given the cost of travel (time usually, or perhaps some agony measure). The incredible thing that Waze does is relearn the weight of the vertex between two nodes.

For example, the time between 145th and 175th should take 5 minutes at one time, but given multiple agents (people with phones in their car), we can update that to reflect current traffic.

The critical thing about Waze is federated learning. They are using multiple actors as training inputs instead of just hiring someone to observe. We are now using phones to collect data on people in real-time and rejigger the predictions.

Which algorithm?

Waze is solving a significant problem called the shortest path problem in graph theory. There are many algorithms for solving this, like A* search, Dijkstra’s algorithm, etc., or my favorite Viterbi algorithm.

That isn’t really where the data comes into play, though. As cars are driving, how do you update the “cost” of vertices (or paths on a map)?

I believe they are incrementally learning the cost between nodes using some temporal difference learning. This incremental learning is similar to exponential smoothing.

For example, the time between A and B might take 30 minutes for Car One but 15 minutes for Car Two. The reason Car One took 30 minutes might range from the passenger stopping to get coffee or just the car driving slower. Also, it might be that Car Two did not leave during rush hour.

Exponential smoothing takes care of all of this averaging, as well as taking into consideration recency overall.

What’s the impact on the industry?

Before Waze, Google Maps worked, but it never took into account what was happening from a traffic pattern because the shortest path distance-wise may not be the quickest way to get somewhere. So if you’re heading home from work and there’s a massive accident that has yet to be cleared up on your typical route home, you can reroute another way (Which saves you time, money, etc.).

Financial Services

I started my data science career as a financial quant, so I’m always amazed at how far the machine learning applications have come in the past decade.

Software development in finance has a reputation for being a tad stodgy, but there are a few companies doing awesome things with machine learning.


Hedge Fund Portfolio Management

A visualization of how financial portfolios could be analyzed.
A visualization of how financial portfolios could be analyzed.

Who’s doing it?

Renaissance Tech

What type of data do you need?

I wish I knew…Renaissance is famously tight-lipped about their methods.

Based on interviews I have seen in the past, most likely, they utilize some HFT (high-frequency trading) algorithms that rely on concise time intervals to make rapid judgments. This kind of goes against the efficient market hypothesis (EMH)…although they make tons of money.

What’s interesting is how close they are to the stock market. Renaissance gets data super fast because they are almost right next door. The latency in the network is so short that they can act quicker than their competitors. Fundamentally though, they use all the same data that the rest of companies use (fundamentals, price, returns, SEC filings, etc.). They can’t use insider information due to Sarbanes-Oxley.

Which algorithm?

Well, in general, portfolio managers generally are trying to be as quick on their feet as possible. So they take into consideration a lot of ARIMA or GARCH models while also utilizing some HMM models as well.

Jim Simons started Renaissance Technologies, and he studied hidden Markov models. So they probably use that. Nobody knows for sure.

What’s the impact on the industry?

If you can maximize returns while minimizing volatility, you make more money. A pretty significant impact if you ask me!


ATM Check Deposit Verification

Computers can analyze a checks numbers, which helps when depositing in an ATM.
Computers can analyze a check’s numbers, which helps when depositing in an ATM.

Who’s doing it?

I know that Chase, Wells Fargo, and Bank of America do this, and there are probably many others.

What type of data do you need?

You need data on handwriting for numbers and text.

Which algorithm?

Most likely, they are using some neural net to detect the check numbers. MNIST is a famous dataset of the handwriting of numbers, so this is a solved problem. There are a ton of ways to do this, but ANNs tend to work very well and have worked for years.

What’s the impact on the industry?

I guess that it helps speed up the process of verifying a check because big banks hire a ton of workforce to validate paper checks after deposit.

A human still has to verify the check, though, so I’m not entirely sure how much time it saves the bank from testing the amount with technology first.


Drive pattern monitoring

How a users driving patterns get analyzed for insurance purposes.
How a user’s driving patterns get analyzed for insurance purposes.

Who’s doing it?

Car insurance companies like Metromile, Safeco, and others monitor driving patterns.

What type of data do you need?

By hooking into the OBI2 adapter for some time (say 90 days), insurance companies can determine the number of times a driver continually slams on the brakes or greatly speeds up.

Which algorithm?

Using the data from the OBI2 adaptor, they most likely determine from their models (probably using simple regression techniques) what your probability for accidents are.

What’s the impact on the industry?

As someone who works from home and doesn’t drive a lot, insurance can be relatively expensive for me. Having this is incredible for determining a better insurance rate so that I can save money.

Monitoring driving patterns also benefits insurance companies: They can make their rates competitive without sacrificing their bottom line.


Detecting Money Laundering

Deep learning graphs made by IBM
Image credit: IBM

Who’s doing it?

PayPal. If you’re doing something unsavory (like dealing drugs), you probably are trying to get rid of (while keeping) your money.

What type of data do you need?

I’m assuming you would need to build some training model to detect money laundering-like activities in a user’s transaction history—such as moving cash to and from PayPal, converting it to bitcoin, etc.

The data needed is usually a “forest” of data when it comes to fraud. Such companies like ThreatMetrix tend to be so great because they collect a lot of data from different sources. Fraudsters tend to spoof or fake things locally but not globally. So, you can fool someone about IP address but maybe not with an IP address, local, operating system, and other things.

This triangulation is the same with money laundering. While I might own a laundromat (which accepts only cash), that laundromat pays some other shell company that pays another shell company into perpetuity until people give up.

Usually, money laundering is more about effort over complexity. If PayPal can start to tie people together (even if it is fuzzy), they can figure out that someone is doing something suspicious. For example, Coconut Grove, Flordia, is a hotbed of fraud for some reason, so any transactions done in that location could be flagged.

Which algorithm?

I guess that it’s a mix of a graphical algorithm with some classification. The degrees of separation between a bad actor and someone else is almost always a big deal in these investigations.

People tend to cover up 1 or 2 transactions, but not 6. So I’d guess they’re using CatBoost or XGBoost, which are both boosting ensemble classifiers.

What’s the impact on the industry?

Money laundering is illegal, so stopping this is probably a good thing.

I guess that PayPal is trying to detect money laundering because 1) they don’t want to break the law, and 2) they are liable for the money if it’s laundered through them so they’d have to give up all their fees.

Either way, it’s in their interest to not let this happen.


Stock Buying

Stock buyers love looking at trends to figure out when to buy and sell.
Stock buyers love looking at trends to figure out when to buy and sell.

Who’s doing it?

Technical traders

What type of data do you need?

It is a statistical moment of a time series! There is a whole lot of ritual around what technical traders do, whether it’s looking at moving averages, moving standard deviation, kurtosis, skew, exponential moving, and weighted moving averages. It’s a way of looking at what is normal and what is not. There are also some Bayesian methods some people use out there.

Which algorithm?

Technical traders generally use something called Bollinger bands to determine whether a stock is trending high or trending low. Bollinger bands use a sliding time window that shows moving averages, and a high and low range.

What’s the impact on the industry?

If you were able to time the market correctly, you could probably turn a dollar into billions of dollars. There are so many fluctuations in the market that if you could pinpoint exactly when to sell high and buy low, you would become a wealthy person.

But that’s not possible yet, so technical traders use some techniques for determining if a stock is a good buy or not based on statistical moments of a time series. Technical signals help trade against FOREX (foreign exchange) as well as other things like equity markets.


Credit scoring

How to interpret an individuals FICO credit score.
How to interpret an individual’s FICO credit score.

Who’s doing it?

FICO

What type of data do you need?

FICO has always taken a scorecard approach to risk modeling. A bunch of heuristics is added up to come up with a “score” for someone’s creditworthiness. Recently though, they are taking a more machine- learning approach, based on actual defaults instead of heuristics.

Which algorithm?

Most likely, FICO is using some regression on the probability of defaults. I’m not sure what models they are using, but probably some logistic regression with some transformations.

What’s the impact on the industry?

Honestly, scorecards are simple to hack – and hacking credit scores isn’t a good thing in the end.

Law enforcement and Government

The government is notorious for having outdated technology and limited funds to update it. Still, lately, I’ve been impressed with the way they’ve applied machine learning to, of course, save money.

The reality is that regardless of industry, human labor is costly. So even the government wants to get in on the machine learning action.

Here are a few of the ones I’ve seen as of late.


Facial Detection

How a picture of a face gets analyzed for recognition using machine learning.
How a picture of a face gets analyzed for recognition using machine learning.

Who’s doing it?

Immigration and Customs Enforcement (ICE) is using Amazon’s Rekognition; there is also talk about using facial detection to prevent school shootings.

What type of data do you need?

Many photos with faces in it. Whether this is the COCO dataset, the ImageNet dataset, or any of the others, there are ample datasets containing images with faces. Fundamentally, we need photos that have been labeled (with coordinates) of faces.

Which algorithm?

Algorithm choice is a hot topic.

Years ago, many tried with Haar Wavelet transforms, which worked surprisingly well. Now we are seeing a renaissance with deep learning where many are utilizing single-shot detection through YOLO, S3FD, and R-CNNs. Once you have a face, then you have to play that against a database of other faces to see if there’s a match.

Most likely, you can run that through a convolutional neural net to extract features out of that face and then effectively match it with another database of faces to see if there is a close match. If there is, then ICE knows they want to target that person.

What’s the impact on the industry?

If ICE can make it work, it can save a lot of time in the long run (especially when you have 100,000 people crossing a border every day). Thanks to facial recognition, border officials can ask a computer, “Is this somebody we’re looking for?”

Don’t get me wrong—this isn’t comprehensive technology by any stretch of the imagination. According to Kori Hale, CEO of CultureBanx:

“Research shows commercial artificial intelligence systems tend to have higher error rates for women and black people. Some facial recognition systems would only confuse light-skinned men 0.8% of the time and would have an error rate of 34.7% for dark-skinned women. A glitch like this could lead to bias among minorities and immigrants.”

Is it still worth pursuing? Because it narrows down your search again, but with any machine learning endeavor, there’s always ethics to consider.


Using Data for Crime Prevention

How a police department might identify crime zones within a given region or area.
How a police department might identify crime zones within a given region or area.

Who’s doing it?

Police departments

What type of data do you need?

Location data—to show where crime happens within a geographical location.

Which algorithm?

Most of these models are using GIS-type models where they are trying to classify bounding boxes on a map. If you’ve ever played Sim City, most likely, you saw a map showing the crime areas.

From there, we can build a more interactive model to see where we need to send patrol officers in real-time.

What’s the impact on the industry?

Like any government agency, police departments don’t have infinite human labor—often there are only a few cars patrolling a given area at a given time. So the allure is simple—utilize resources effectively.


Weather Forecasting

Predicted possible storm paths
Predicted possible storm paths. Image credit: Cycoclane

Who’s doing it?

NOAA, national research labs

What type of data do you need?

Whatever gets collected at weather stations—temperature, etc.

So, if meteorologists were forecasting the weather for next week, they would look at the model from NOAA as well as the European models. Therefore, you get multiple looks on the same set of data.

Which algorithm?

Because weather forecasting involves taking many models and aggregating them together, this is ensemble learning. Ensemble learning can take many forms, including bagging, boosting, and variants.

What’s the impact on the industry?

For farming, it’s essential to know what the weather is currently, as well as tomorrow, and as far as ten days from now, because it helps with predicting crop yield.

For other things? Knowing the weather can help us predict disasters, navigate ships, chart flight plans, prepare emergencies—the possibilities are endless.

Health Care

Thanks to advances in medicine in health care, humans are living longer. So it makes sense that from a genetics or diagnostics viewpoint, machine learning can be used to detect disease, link relatives to one another, and more.

There are many up-and-coming applications of machine learning in health care, so I’ve only outlined a few I’ve come across in my reading.


Radiation Treatment for Cancer

How a doctor might map a patients radiation treatment plan. Image credit: Microsoft
How a doctor might map a patient’s radiation treatment plan. Image credit: Microsoft

Who’s doing it?

InnerEye and Microsoft Research.

What type of data do you need?

You would need a training corpus of images from patient MRI scans to detect what and where tumors are. InnerEye also takes hints from the user (a radiologist or oncologist) to map tumors.

Which algorithm?

Finding tumors is achieved using object detection. Most likely, they are using some R-CNN or YOLO classifier for this.

What’s the impact on the industry?

This one hits home for me because I had testicular cancer in 2015. I didn’t undergo radiation, but one of the most significant issues with the treatment is doing it in a way that kills the tumor while preserving the rest of the organs.

Thanks to InnerEye, doctors have some new techniques to target more tumors than non-cancerous cells while treating cancer patients with radiation.


23andMe Relatives

Image from 23andme
Image credit: 23andme

Who’s doing it?

23andme, as well as MyHeritage DNA do a lot with DNA searching.

What type of data do you need?

You would need DNA samples from a group of people. For 23andme, you spit your saliva into a tube.

Which algorithm?

Most likely, they are doing this through some nearest neighbors search. The genome is quite large and turns into a large vector. Using neighbors, you can start to see who matches and who doesn’t use a bitmask.

Another thing they could be doing is using a Ball tree, which could determine someone’s ancestry based on some archetypal DNA. So if 100 people from Britain all share the same sources of data and you come in, then most likely you are British.

What’s the impact on the industry?

Now you can find long-lost relatives based on DNA matches, as well as get information on whether you have genetic markers for specific diseases.

For me, it was essential to understand my ancestry since my grandmother was adopted. But for other people, like my wife, it wasn’t super impactful to learn she’s 99% East Asian.


Cancer Detection (Glaucoma, Skin and Breast Cancers)

Detecting benign and malignant glaucoma and cancers
Teaching a computer how to analyze for benign vs. malignant tumors. Image credit: PeerJ

Who’s doing it?

Google, as well as Dr. Louis Pasquale at Mount Sinai Hospital, is using machine learning to detect glaucoma and skin cancers.

What type of data do you need?

The data needed is slides containing cells from tissue biopsies, which need to be labeled by an oncologist or radiologist.

Which algorithm?

Like in the radiation treatment example, this is achieved using object detection. Most likely, the researchers are using some R-CNN or YOLO classifier for this.

What’s the impact on the industry?

Instead of doctors spending hours poring over slides of tissue biopsy they do by hand, machine learning can detect cancerous cells or at least give doctors a better idea of where to look.

Considering agreement in diagnosis for some forms of breast cancer and prostate cancer can be as low as 48%, this is probably a good thing for medical advancement.


Medical Data Visualization

Image from Optical Society of America
Image credit: Optical Society of America

Who’s doing it?

SigTuple

What type of data do you need?

It depends on what type of test you’re looking at, but SigTuple helps analyze peripheral blood smears, urine microscopy, semen, fundus and OCT scans, and chest x-rays. So you’d probably need training data from patients who are undergoing those tests.

Health care is complicated because there are tons of training data available, but access is hard to come by due to HIPAA.

Which algorithm?

Most likely, they are using some computer vision techniques to sharpen the focus of these images.

From there, you could classify whether there is a problem with blood, urine, or semen with a deep learning net.

What’s the impact on the industry?

SigTuple does an excellent job on its homepage of describing its impact:

We dramatically improve the speed, accuracy, and consistency of several screening processes. Medical institutions can now serve more patients, with a significant reduction in human errors. Patients can now be more confident in the quality of healthcare delivery.

I couldn’t have said it better myself.

Manufacturing & Logistics

I consider these the “invisible processes” that happen every day yet are invisible to the average human being. No one sees it, so it’s not as sexy of an application as social media.

Looking at old processes is where I get excited about machine learning, though, because the possibilities are endless IF you can partner with someone with deep domain expertise.


“Defect” Detection With Images

How an orange gets analyzed for defects via machine learning.
How an orange gets analyzed for defects via machine learning. Image credit: Vietnam Journal of Computer Science

Who’s doing it?

Hyper AI, Garver, and others.

What type of data to you need?

Images. The general idea is that if hyperspectral, IR or thermal cameras are used to analyze things like cracks in the pavement or a box of fruit, it can detect things more accurately than the human eye.

Which algorithm?

R-CNNs are most likely used, or perhaps YOLO or some other variant of deep learning. You’d have to back out region detection, as well as to object detection.

What’s the impact on the industry?

Major. Human labor jobs, especially inspection jobs, take up a lot of time and money and are (sometimes) low inaccuracy. The computer can be “trained” to be more accurate than the human eye – or at least give a visual cue for humans to know where to look.

For example, if you were a grocery store and the box of peaches you received is full of tarantulas, you’d probably want to know that before your employees opened the box!


Estimated delivery time

How Postmates fits a linear equation to a two dimensional set of points (multivariate linear regression).
How Postmates fits a linear equation to a two dimensional set of points (multivariate linear regression).

Who’s doing it?

Postmates.

What type of data do you need?

Postmates used a pretty intuitive approach to regression with their data. They wanted to find the relationship between delivery time as it relates to:

  • Distance between the pickup location and the drop-off location
  • Distance between the post mate and the pickup location
  • How long it takes the merchant to make an order
  • The kind of vehicle the post mate is on, like a bicycle, car, motorcycle, helicopter, or on foot

Which algorithm?

Postmates outlined a very traditional linear regression model that utilized cleansed data and lots of useful quality features (distance, the average time to make an order, as well as specific features like the vehicle type).

What’s the impact on the industry?

Nothing earth-shattering, but it’s nice to know when your dinner is going to be delivered so you can plan ahead.

Estimation applies to problems other than meal delivery. I guess that companies would love to know when supplies are coming in so they can make better estimates about when something on a larger scale (say, an automobile) will be ready to release to the public.


Robotic Process Automation

Robotic Process Automation
Image credit: MIT

Who’s doing it?

FortressIQ

What type of data do you need?

I believe the way that FortressIQ is doing this is by creating screencasts of someone doing a repetitive task. Watching and learning is a variation on expert systems or mimicry. If you watch someone do something enough, then you start to learn what they are doing and what the implicit goals are. AlphaGo and AlphaZero use this, where the rules of the engagement are known but not how to get there. After many episodes of watching, the algorithm starts to piece things together

Which algorithm?

Reinforcement learning and deep convolutional neural nets are an approach I would take to achieve something like this.

What’s the impact on the industry?

If you’re a Fortune 500 company, there’s bound to be a few “old school” processes that take up a ton of support time yet drag on profitability. Thanks to machine learning, some of them can be automated and turned off and on with a switch.

Compare that to hiring someone to maintain it for you—a huge difference!


Manufacturing Improvements

An infographic of how AI can help automate factory processes.
An infographic of how AI can help automate factory processes.

Who’s doing it?

GE and others.

What types of data do you need?

Most of the time, manufacturers know when there is some defect or “chain pulling” event. The problem is piecing it all together into one cohesive stream.

What GE is doing is utilizing IoT devices that integrate throughout the entire build process to figure out when something is not acting normally. The training data is more or less previous defects that the manufacturer detected.

Which algorithm?

GE detects through a suite of tools called Predix, which supposedly uses deep learning techniques. Most likely, what this means is that they are using a big neural network to classify as well as extract out features within millions of IoT sensor data points.

What’s the impact on the industry?

According to this article, it will “bring down labor costs, reduce product defects, shorten unplanned downtimes, improve transition times, and increase production speed.”

Pretty significant if you ask me.


Lean Shipping

How stores request products from vendors using lean shipping.
How stores request products from vendors using lean shipping.

Who’s doing it?

Brick and mortar retail shops, like Lowe’s, Safeway, or others, are always looking for ways to reduce shipping costs and inventory. The goal is to only have on hand what customers demand.

What type of data do you need?

To be able to do lean shipping, a company needs to know a few things: what customers will want and how much, as well as any other identifying factor (such as season). Over time, a company can start to learn customer patterns of purchasing.

For instance, Lowe’s might know that A/C units start to heat up around April or May and ship those items to the store in March to anticipate customer demand.

Which algorithm?

Most likely, these brick and mortar companies are balancing Takt Time and Customer Demand. Companies like Lowe’s and Safeway are trying to figure out the equilibrium point of how much they can reliably ship, as well as what the customer demand is.

Trying to predict what customers want tomorrow is a moving target, so this is where many regression techniques come into play.

What’s the impact on the industry?

Real estate is expensive. If stores were to hold ten days worth of inventory at a time, they’d probably take up 3x the space and spoilage.

Using this hub and spoke model of only having what they need on hand on any given day and basing this purely on demand is excellent for reducing prices and increasing efficiency in the system.

The common belief is that there are only three days of supplies at a grocery store at any given moment. You could read this as we are 3 days from impending doom. Or you could read it as we only have just enough in storage for what we need when we need it: less spoilage or stock in the system.


Agribotix

How computer imaging can analyze plots of farmland for optimal crop growing.
How computer imaging can analyze plots of farmland for optimal crop growing.

Who’s doing it?

Agribotix

What type of data do you need?

Images (from drones) can be trained to determine whether certain crops are growing correctly or not.

Which algorithm?

I believe that Agribotix is probably using some hyperspectral imaging and classifying features based on convolutions of a deep neural net. They do this by looking at NIR or infrared lights.

What’s the impact on the industry?

If you’re an agricultural farmer, you don’t have a ton of free time on your hands. Rather than driving through thousands of acres manually, farmers can use machine learning to gain greater situational awareness of their crops.


Automated Cranes

An overhead crane
An overhead crane.

Who’s doing it?

Konecranes is the leader in this realm.

What type of data do you need?

Most likely, Konecranes is using some version of computer vision classifications as well as some rule-based work to automate moving apart from one place to another. Some of this is very deductive, although generally, the testing over time is essential in QA.

Which algorithm?

I suspect that Konecranes is using some variation of programmed travel and some computer vision safety guards.

What’s the impact on the industry?

In the crane industry, safety matters most. Some cranes even manage things like spent fuel rods, which could result in a catastrophe the size of Chernobyl.

Even scarier? Most of the time, crane accidents are due to operator error. So, for our safety, it’s probably a good thing fully-automated cranes exist.


Emissions Reduction

Image credit: veeterzy on Unsplash
Image credit: veeterzy on Unsplash

Who’s doing it?

Siemens, who manufactures gas turbine engines for airplane makers like Boeing and Airbus, is currently trying to reduce emissions with deep learning.

What type of data do you need?

The kind of data needed to do this is emissions data collected, like carbon monoxide, cross-referenced to some other features as input (rpm, gas mixtures, ignition schedules, temperature). All of these sensors receive what the engine computer supports.

Which algorithm?

Most likely, they used a feed-forward neural net with some in-depth features in it. Then based on some training data, they figured out the most optimal combination to figure out the lowest emissions, although they could have used another algorithm such as LIPO or some genetic algorithm as well.

What’s the impact on the industry?

Siemens was able to reduce emissions by 10-15% over what experts had already tuned.

Even better? Learning the best tuning gives us the future possibility of improving emissions for cars and other “emitters.”

InfoSec, Data Security and Identity Protection

The more activities we do online, the more critical it is for companies to have security measures in place.

While brick and mortar stores may have a security team monitoring a suspicious customer’s every move, online stores have no way to verify whether a customer is who they say they are.

That’s where machine learning can help automate some parts of the process.


Verifying Identity at SheerID

Its easy to forget youre no longer a student, especially if youve been getting discounts from retailers because of it.
It’s easy to “forget” you’re no longer a student, especially if you’ve been getting discounts from retailers because of it.

Who’s doing it?

SheerID to verify who is eligible for a student discount.

What type of data do you need?

What SheerID utilizes to check whether someone is fraudulent is based on previous information collected by labeling data using a task force of humans. Many times this is legit. If you start from scratch, get a bunch of team members to label things by hand and store that.

Which algorithm?

SheerID is a client so I can’t divulge too much information. But I can tell you we used some supervised learning techniques – more specifically, a boosted algorithm.

What’s the impact on the industry?

For SheerID? Huge. Before using machine learning, this was a manual review process that took hours to complete.

Now with machine learning, 95-96% of the incoming requests do not need to be manually touched, significantly improving their leverage.


Fraud detection

How fraudulent transactions can be analyzed using machine learning.
How fraudulent transactions can be analyzed using machine learning.

Who’s doing it?

ThreatMetrix and Sift.

What type of data do you need?

I believe these companies model fraud on a few factors:

  1. IP ranges (whether it’s on a VPN or ISP)
  2. Geolocation
  3. Browser / OS combo
  4. Browsing patterns

Which algorithm?

There are many ways to detect fraud, but I suspect they’re using some classification with CatBoost or XGBoost.

What’s the impact on the industry?

Fraud is a big deal. If someone buys a $20,000 diamond ring on your website, you want to be sure they will pay for what they say they will.

After all, most of the time, credit card companies will charge back to the company. So it’s in any e-commerce company’s interest to make sure they don’t lose money due to fraud.


Distributed Denial of Service (DDoS) Attack Prevention

A visualization of a DDos attack.
A visualization of a DDos attack.

Who’s doing it?

Cloudflare

What type of data do you need?

Information from previous network attacks.

Which algorithm?

One of the most challenging things about DDoS is the fact that the attack comes in from many machines. The way DDoS works are by utilizing infected computers. In terms of features, it’s hard to detect who is a zombie and who isn’t, so most likely, the best way to figure this out would be XgBoost / CatBoost with some precise information, like the OS version and the timing of attacks.

What’s the impact on the industry?

DDoS attacks can take down a site, which can cost companies thousands of dollars, but companies can’t block all incoming traffic because that would also cost companies thousands of dollars.

That said, having mitigating techniques to detect DDoS attacks is pretty essential.


Antivirus and malware detection

A common antivirus message
A common antivirus message

Who’s doing it?

Malwarebytes, MacAfee, Norton, and Panda all address some variation of this problem.

What type of data do you need?

Examples of malware to detect. Companies like these generally have many honeypot machines that collect tons of malware to analyze.

Which algorithm?

There are many ways to slice security analysis. It’s something that I would love to know more about but is an incredibly difficult subject to get right, so I’m not going to act like an expert here.

As it relates to machine learning, the idea is to determine if something has a high enough probability based on specific behaviors. One of the biggest problems with the old way of detecting security flaws was to search for particular types of code execution by hand-coding. Machines can learn what viruses and malware act like and recognize them based on training data. Many individuals use random forests to classify security flaws.

What’s the impact on the industry?

Well, no one wants a DDoS attack to turn their computer into a zombie—or even worse, take over critical infrastructure like what happened to one of Iran’s nuclear power plants.

So it’s probably a good thing these companies are analyzing the behaviors of software to figure out what’s indicative of a virus.

Social Media

I like to refer to social media as the “sexy industry application” of machine learning. There’s no shortage of data to analyze, and it helps B2C companies fine-tune their advertising.

I limit my activities on social media, so I’m forever fascinated by how machine learning can detect and predict attributes of an individual, even in ways bordering on unethical.


Lookalike Audiences

look alike clusters
Look alike clusters

Who’s doing it?

Facebook.

What type of data do you need?

Network data. In this case, people on Facebook fill out a profile.

Which algorithm?

Facebook gives you a threshold of similarity. Most likely, they are doing this based on the distance from the mean group. The attachment to a group is very similar to K-Means clustering.

Most likely, they aren’t using K-Means directly, although it gives a good analog for what they are doing. If you have a query audience, you want to find people close to that centroid. You can also let it figure for further and further away based on a threshold.

What’s the impact on the industry?

For myself, it’s great to know to whom I am marketing.

If I have a list of 300 people who love machine learning, I can load that into Facebook and say, “give me a bigger audience similar to that audience” without actively going out to find thousands of more people. FB has the network to query already.


Social Influencers and Brand Matching

Who’s doing it?

Snapppt.

What type of data do you need?

In this case, Instagram data — profile information, photos, comments, etc.

Which algorithm?

My client asked me to do this, and what we utilized was an InceptionV3 architecture with finetuning at the end.

What’s the impact on the industry?

Knowing who is a brand and who is a creator is impactful because brands want to target specific influencers (i.e., creators).

For instance, outdoor brand Patagonia might be searching for climbers, which relates to things like rappelling and the outdoors. Patagonia wants to be able to search through Instagram profiles quickly.


People Recommendations

connections you could make on linkedin
Connections you could make on LinkedIn

Who’s doing it?

Any social network, but I’m thinking of the one used on Linkedin.

What type of data do you need?

You need people in a social network that are “connected” with each other.

Which algorithm?

Once you have a social network of people, you can start to reduce the dimensions using something like PCA or ICA and match based on distance with something like K-Nearest Neighbors.

What’s the impact on the industry?

The more people there are in a social network, the wealthier the community is. Users can “discover” people who share their interests and friends and build relationships from there.


Newsfeed Generation

facebook newsfeed
Facebook Newsfeed

Who’s doing it?

Most famously, Facebook is doing it, but any content delivery network could use the same concept.

What type of data do you need?

A history of what users have clicked on and interacted with it vs. not is probably the dataset.

Which algorithm?

Facebook’s Newsfeed algorithm is a mystery, but it is still fundamentally a sorting algorithm with predictions behind each of the values. For each piece of content that is within your “network,” it probably ranks a “high” or “low” for you.

After assigning a value to each post or object, you can then sort the values and present them in the form of a newsfeed. I guess that values are ranked based on relevancy, the likelihood of interaction, and some other trade-secret heuristics.

What’s the impact on the industry?

The goal of any social network is to get more people to spend time on it (and click on ads). If Facebook shows you content that you want to see, you’re more likely to continue engaging with the app.

The pessimist in me says there’s a negative impact as well, as users are now addicted to Facebook.


Ad optimization

n armed bandits
N armed bandits

Who’s doing it?

Google, and Facebook.

What type of data do you need?

One way to collect data is to show an ad to the user and then ask if the ad is relevant to them. Another way is to display advertisements and see what they click or open.

Which algorithm?

Google and Facebook have particular algorithms designed for their platforms, but the general approach they are solving is the N-Armed Bandits problem. For N number of slot machines, which one should you pull, and at what time?

N-Armed Bandits have been solved many different times with a Gittins index or the like. You want to serve up ads that users are likely to click on, but also give them an element of surprise as well.

What’s the impact on the industry?

Facebook and Google make most of their money off of ads. If customers are paying for ads, those companies want to be sure that potential customers are going to click on them so that their customers are happy.

Plus, there’s nothing more annoying as a user than being shown an ad for something utterly irrelevant to our interests, so it’s a win-win for both sides.


Geolocation

Stravas Geolocations
Strava’s Geolocation

Who’s doing it?

Strava, to the detriment of the U.S. military.

What type of data do you need?

In this case, users were tracking their workout patterns — and Strava was keeping track of all of their geolocation data.

Which algorithm?

Not much ML is needed to separate this. You need a way to see where a hot spot of GPS coordinates is in a specific place. Finding hotspots could be done via an unsupervised algorithm. Clustering is the right way to do this.

What’s the impact on the industry?

In this case, it isn’t great because now anyone can figure out where military bases are (especially in third-world countries like Afghanistan).

Of course, you can use geolocation for other positive impacts — like knowing where friends are at any given moment — but there’s nothing wrong with making a phone call.


Photo Aggregation

Yelps photo aggregation
Yelp photo aggregation

Who’s doing it?

Yelp.

What type of data do you need?

Photos, and, most likely, comments associated with those photos make up a reasonable basis for a dataset.

Which algorithm?

Deep learning. Here’s how Yelp did it here, although this is just one approach.

What’s the impact on the industry?

If I’m looking for details on a restaurant near me and find 123 user-generated photos, it’s a lot for me to scroll through (when all I’m looking for is what a dish looks like). By separating the pictures into categories, I can sort through the ones that only show food, which splits it up in half of the time.


Did I miss anything?

If you know of a machine learning application I didn’t share, leave it in the comments below.

And, of course, if you like what you’ve read here, others are sure to feel the same way, so share this post by tweeting it out or posting it on Facebook. Thank you!

-Matt

Leave a Reply

Your email address will not be published. Required fields are marked *